Introduction
The widespread use of databases and the fast increase of the volume of data they store are creating a problem and a new opportunity for credit companies. These companies are realizing the necessity of making an efficient use of the information stored in their databases, extracting useful knowledge to support their decision-making process.
Nowadays, knowledge is the most valuable asset a company or nation may have. Several companies are investing large sums of money in the development of new computational tools able to extract meaningful knowledge from large volumes of data collected over many years. Among such companies, companies working with credit risk analysis have invested heavily in sophisticated computational tools to perform efficient data mining in their databases.
The behavior of the financial market is affected by a large number of political, economic, and psychological factors, which are correlated and interact among themselves in a complex way. The majority of these relations seems to be probabilistic and non-linear. Thus, these relations are hard to express through deterministic rules.
Simon (1960) classifies the financial management decisions in a continuous interval, whose limits are non-structure and highly structured. The highly structured decisions are those where the processes necessary for the achievement of a good solution are known beforehand and several computational tools to support the decisions are available. For non-structured decisions, only the managers' intuition and experience are used. Specialists may support these managers, but the final decisions involve a substantial amount of subjective elements. Highly non-structured problems are not easily adapted to the computer-based conventional analysis methods or decision support systems (Hawley, Johnson, & Raina, 1996).
background
The extraction of useful knowledge from large databases is named knowledge discovery in databases (KDD). KDD is a very demanding task and requires the use of sophisticated computing techniques (Brachman & Anand, 1996; Fayyad, Piatetsky-Shapiro, Amith, & Smyth, 1996). The recent advances in hardware and software make possible the development of new computing tools to support such a task. According to Fayyad et al. (1996), KDD comprises a sequence of stages, including:
• Understanding the application domain,
• Selection,
• Pre-processing,
• Transformation,
• Data mining, and
• Interpretation/evaluation.
It is also important to stress the difference between KDD and data mining (DM). While KDD denotes the whole process of knowledge discovery, DM is a component of this process. The DM stage is used as the extraction of patterns or models from observed data. KDD can be understood as a process that contains the previous listed steps. At the core of the knowledge discovery process, the DM step usually takes only a small part (estimated at 15-25%) of the overall effort (Brachman & Anand, 1996).
The KDD process begins with the understanding of the application domain, considering aspects such as the objectives of the application and the data sources. Next, a representative sample, selected according to statistical techniques, is removed from the database, preprocessed, and submitted to the methods and tools of the DM stage with the objective of finding patterns/models (knowledge) in the data. This knowledge is then evaluated regarding its quality and/or usefulness, so that it can be used to support a decision-making process.
Frequently, DM tools are applied to unstructured databases, where the data can, for example, be extracted from texts. In these situations, specific pre-processing techniques must be used in order to extract information in the attribute-value format from the original texts.
credit risk assessment
Credit risk assessment is concerned with the evaluation of the profit and guaranty of a credit application. According to Dong (2006), the main approaches proposed in the literature for credit assessment can be divided into two groups: default models and credit scoring models. While default models assess the likelihood of default, credit scoring models assess the credit quality of the credit taker. This text covers credit scoring models.
A typical credit risk assessment database is composed of several thousands of credit applications. These credit applications can be related with either companies or people. Examples of personal credit applications are student loans, personal loans, credit card concessions, and home mortgages. Examples of company credits are loans, stocks, and bonds (Ross, Westerfield, & Jaffe, 1993).
Usually, the higher the value of the credit asked, the more rigorous is the credit risk assessment. Large financial institutions usually have whole departments dedicated to this problem.
The traditional approach employed by bank managers largely depends on their previous experience and does not follow the procedures defined by their institutions. Besides, several deficiencies in the dataset available for credit risk assessment, together with the high volume of data currently available, makes the manual analysis almost impossible. The treatment of these large databases overcomes the human capability of understanding and efficiently dealing with them, creating the need for a new generation of computational tools and techniques to perform automatic and intelligent analysis of large databases.
In 2004, the Basel Committee on Banking Supervision published a new capital measurement system, known as the New Basel Capital Accord, or Basel II, which implements a new credit risk assessment framework that supports the estimation of the minimum regulatory capital that should be allocated for the compensation of possible default loans or obligations (Basel, 2004; Van Gestel et al., 2006). The Basel Committee was created in 1974 by the central-bank of 10 countries. In 1988, the committee introduced a capital measurement system commonly referred to as the Basel Capital Accord. The first accord established general guidelines for credit risk assessment. The new accord, Basel II, stimulates financial institutions to adopt customized rating risk systems based on their credit transaction databases. As a consequence, DM techniques assume a very important role in credit risk assessment. They will allow the replacement of general risk assessment by careful analysis of each loan commitment.
Credit analysis databases usually cover a huge number of transactions performed over several years. The analysis of these data may lead to a better understanding of the customer's profile, thus supporting the offer of new products or services. These data usually hold valuable information, for example, trends and patterns, which can be employed to improve credit assessment. The large amount makes its manual analysis an impossible task. In many cases, several related features need to be simultaneously considered in order to accurately model credit user behavior. This need for automatic extraction of useful knowledge from a large amount of data is widely recognized.
using data mining for credit risk assessment
DM techniques are employed to discover strategic information hidden in large databases. Before they are explored, these databases are cleaned. Next, a representative set of samples is selected. Machine learning techniques are then applied to these selected samples. The use of data mining techniques on a credit risk analysis database allows the extraction of several relevant pieces of information regarding credit card transactions.
The data present in a database must be adequately prepared before data mining techniques can be applied to it. The main steps employed for data preparation are:
• Preprocessing of the data to the format specified by the algorithms to be used;
• Reduction of the number of samples/instances;
• Reduction of the number of features/attributes;
• Features construction, which is the combination of one or more attributes in order to transform irrelevant attributes to more significant attributes; and
• Noise elimination and treatment of missing values.
Once the data have been pre-processed, machine learning (ML) techniques can be employed to discover useful knowledge. The quality of a knowledge extraction technique can be evaluated by different measures, such as accuracy; comprehensibility; and new, useful knowledge.
The application of data mining techniques for credit risk analysis may provide important information that can improve the understanding of the current credit market and support the work of credit analysts (Carvalho, Braga, Rezende, Lu-dermir, & Martineli, 2002; Eberlein, Breckling, & Kokic, 2000; Horst, Padilha, Rocha, Rezende, & Carvalho, 1998; Lacerda, de Carvalho, Braga, & Ludermir, 2005; Dong, 2006; Huang, Hung, & Jiau, 2006).
Credit analysis can be seen as a pattern classification task that involves the evaluation of the reliability and profitability of a credit application. If the labels associated with the data, characterizing each customer as either a good or a bad customer, are available, customers can be classified into these two classes. This is a binary classification problem. However, a credit scoring system does not need to be restricted to two classes. Financial institutions may have different categories for the customers, according to the provision required, for example. These categories usually form a ranking, with the best customers close to the top. Current economic indexes and changes in the customer profile may change his or her position in the ranking. In this case, a multiclass classification system should be used, preferably one that allows ranking classification.
In credit risk assessment, different misclassifications have different costs. However most of the existing data mining algorithms assume that the goal is to minimize the number of misclassification errors. Whenever different errors have different costs, several authors (Breiman, Freidman, Olshen, & Stone,1984; Turney, 1995) propose to minimize the conditional risk, which is the expected cost of predicting that xbelongs to Class The conditional risk is defined as: Risk(Classi / x) = sumjP(Classi / x) * C(i,j), where C(i,j) is the cost of predicting class i when the true class is j.
Different ML techniques have been used for credit risk assessment datasets. One of the studies found in the literature (Horst et al., 1998) compares the performance of neural networks and decision trees for credit risk assessment. Alternative approaches for the induction of Bayesian classifiers applied to credit scoring are investigated (Baesens, Egmont-Petersen, Castelo, & Vanthienen, 2002). Dong (2006) studies the influence of the metric distance in the performance of a case-based reasoning system used for credit scoring. A model based on support vector machines following the Basel II accord is proposed in Van Gestel et al. (2006). Another work using support vector machines highlights the importance of each input attribute selection (Yu, Lai, & Wang, 2006). The problem of imbalanced distribution of examples into the classes and its influence on different ML techniques is investigated in Huang et al. (2006).
Several of the most recent approaches are based on hybrid intelligent systems (HISs). In Mendes Filho, de Carvalho, and Matias (1997), multi-layer perceptron neural networks are designed by genetic algorithms to credit assessment. Fuzzy and neurofuzzy approaches for credit scoring are compared in Hoffmann, Baesens, Martens, Put, and Vanthienen (2002). Lacerda et al. (2005) show how radial basis function neural networks tuned for credit assessment problems can be evolved by genetic algorithms. Two approaches for learning fuzzy rules for credit scoring by using evolutionary algorithms are investigated in Hoffmann, Baesens, Mues, Van Gestel, and Vanthienen (2007). In another hybrid approach, a classifier based on multi-criteria linear programming is combined with independent analysis (Li, Shi, Zhu, & Dai, 2006).
critical issues of credit analysis in data mining
There are, of course, a few problems that can arise from the use of data mining techniques for credit risk analysis. The main advantages and problems associated with this technology are briefly presented in Table 1. Current Internet technology may allow the invasion of company databases, which may lead to the access of confidential data. Besides, the stored data may be made available to other companies or used without the knowledge or authorization of the customer. On the other hand, the extraction of useful information from credit risk analysis databases should result in benefits not only to the companies, but also to the customers, with the offering of better and more convenient credit products.
Table 1. A summary of critical issues of credit risk analysis data mining
Data Mining Extraction of useful information from large Databases | Privacy & Confidentiality Agreements Addressing individual right to privacy and the sharing of confidential information |
Concept Drifting When statistical properties of the target concept chance over time. | Training of Employees Credit analysis should be trained with the new tools available and shown the benefits of the new technologys |
Inadequate information Applicants may have supplied incorrect information, intentionally or not | User Ignorance and Perceptions Lack of adequate understanding of the Data Mining and it usefulnes |
Maintaining and integrity of data Maintaining up-to-date and accurate information on the databases | Security Maintaining secure and safe systems and keeping unauthorized user access out |
future trends
A key issue in credit risk assessment is the non-stationary properties of the problem: interests change over time. In this context, methods that take drift into account can improve the generalization capability of the learning algorithms by adaptation of decision models to the most recent data (Gama, Medas, Castillo, & Rodrigues, 2004).
Credit-related data is being produced at growing rates. If the process is not strictly stationary (as in most real-world applications), the target concept could gradually change over time. For example, classes' profiles for good and bad clients may rapidly change. The ability to incorporate this concept drift is a natural extension for future DM systems. In real-time credit assessment, large chunks of data will be collected over time. Future credit assessment DM tools will have to deal with possible modifications in the existing classes. A natural approach for this new situation is the use of adaptive learning algorithms, where incremental learning algorithms take into account such concept drift.
Future work in this area should also take into consideration security issues, support to ubiquitous computing, and a better integration with current databases technology.
conclusion
Several companies perform data mining on personal data stored in their databases. This is particularly true for credit risk analysis databases, where customers' personal data can be employed to improve the quality of credit assessment and support the offer of new credit products. Data mining systems usually employ artificial intelligence and statistical techniques to acquire relevant, interesting, and new knowledge for most applications, including credit risk analysis. Although one cannot avoid the use of these sophisticated techniques to support credit assessment, great care should be taken to guarantee privacy and that personal rights are not violated when working with private information.
A financial problem that shares several similarities with credit risk assessment is bankruptcy prediction. In the movement towards economic blocks, the growing globalization demands robust and reliable systems for banks' bankruptcy forecasting. This demand comes from different sources, like managers, investors, and government organizations. DM has also been successfully used for bankruptcy prediction, as can be seen in Martineli, Diniz, de Carvalho, Rezende, and Matias (1999), Atiya (2001), and Chakraborty and Sharma (2007).
key terms
Consumer Credit: A loan to an individual to purchase goods and/or services for personal, family, or household use.
Credit: Delivery of a value in exchange of a promise that this value will be paid back in the future.
Credit Scoring: A numerical method of determining an applicant's loan suitability based on various credit factors such as types of established credit, credit ratings, residential and occupational stability, and ability to pay back loan.
Data: The set of samples, facts, or cases in a data repository. As an example of a sample, consider the field values of a particular credit application in a bank database.
Data Mining: The process of extracting meaningful information from very large databases. One of the main steps of the KDD process.
KDD: Process of knowledge discovery in large databases.
Knowledge: Defined according to the domain, considering usefulness, originality, and understanding.
Machine Learning: Sub-area of artificial intelligence that includes techniques able to learn new concepts from a set of samples.