Regression Techniques for Credit Scoring : A Case Study for the Commercial Bank of Zimbabwe ( Bulawayo )

Credit creation is the main income generating activity for banks. However this activity involves huge risks to both the lender and the borrower. The risk of a trading partner not fulfilling his or her obligation as per the contract on due date or any time thereafter can greatly jeopardise the smooth functioning of a bank’s business. Credit risk therefore is one of the greatest concerns to most banking authorities and banking regulators. This paper is aimed at coming up with a model that can be used by the Commercial Bank of Zimbabwe in calculating the risk associated with credit scoring. The data set used covered personal loans from January 2010 to January 2012. Linear and Buckley James regression tests were employed to find the explanatory variables influencing time to default and repayment. In investigating customer classification, linear discriminant analysis was applied. Age, marital status, loan purpose and time at current job were found to be linearly related to time to default. Time to repayment was found to be linearly related to age, marital status and loan purpose. 67.5% of the original cases were found to be correctly classified. Buckley James regression out performed linear regression hence it was found to be the most suitable method in determining variables affecting risks in loan lending.


Introduction
There is no instrument that can be used to predict the future accurately but when dealing with lending, banks try to predict the outcome of that loan.Banks have to find out the possibility of a customer either defaulting on the loan or not.Like all debt instruments, a loan entails the redistribution of financial assets over time, between the lender and the borrower.The borrower initially receives an amount of money from the lender, which he pays back, but sometimes not always in regular installments, to the lender.
Credit scoring uses quantitative measures of the performance and characteristics of past loans to predict the future performance of loans with similar characteristics (Caire & Kossman, 2003).Credit scoring is a scientific method of assessing the credit risk associated with new credit applications.Statistical models derive predictive relationships between application information and the likelihood of satisfactory repayment.Models are empirically designed; that is, they are developed entirely from information gained through prior experience.Therefore, credit scoring is an objective risk assessment tool, as opposed to subjective methods that rely on a loan officer's opinion.Clearly, credit scoring is a risk management tool.Scoring systems can help a bank ensure more consistent underwriting and can provide management with a more insightful measure of credit risk.
Credit scoring cannot predict individual loan loss; rather it predicts the likelihood or odds of a bad outcome, as defined by each bank; usually this will be some level of average or total days in arrears at which associated costs make the loans unprofitable, nor should a credit scoring system alone approve or reject a loan application; rather the underwriter must decide how he or she will incorporate the credit score into the loan review.Finally, credit scoring is not meant to increase approval rates; rather, it promotes consistency and efficiency while maintaining or reducing historic delinquency rates.It also allows the users to focus their attention and time on applications that are not obvious approvals or obvious declines (Caire & Kossman, 2003).Hence the research aims at coming up with a censored regression model that can be used in calculating the risk associated with credit.
CBZ is a registered commercial bank in Zimbabwe which was established in 1980 offering a wide range of innovative banking and financial services to personal and corporate customers.Banks generally provide a variety of services that include but are not limited to cash and cheque deposits and withdrawals; provision of credit facilities such as loans, overdrafts and credit cards; processing payments; asset financing; mortgages; clearing; foreign exchange; money transfer; advisory services; safe keeping services; and custodial services (Ambira & Kemoni, 2011).CBZ has banking products which include savings accounts, current accounts, foreign currency accounts, fixed deposits, cash manager accounts, personal loans, private home and commercial loans, micro leasing, asset finance, agribusiness finance, micro finance loans, offshore credit, and business loans.The company also offers foreign currency services, trade finance, international banking, investment banking, small to medium enterprises financing, treasury management, wealth management, agribusiness, custodial services, and bancassurance.Credit scoring at CBZ is done using a credit score sheet which is a standard document with specific attributes used when appraising a loan application.The score sheet is user friendly as there are guidelines on every attribute.The credit risk faced by CBZ is that of the customer defaulting on their loan.Hence the research aims at coming up with a censored regression model that can be used in calculating the risk associated with credit.

Literature Review
Decisions on whom to grant credit, and of how much credit to grant, originally relies purely on the skill of a loans officer.The loans officer uses his experience and personal judgement, and guided by attributes that affect the credit worthiness of the applicant, he then makes a decision on whether or not to grant credit.The attributes deemed most important are referred to collectively as the five Cs of credit (Thomas et al., 2002).They are: 1).Character -The willingness to pay debt.For example, how long has the applicant been at their current job?
2).Capacity -The borrower's capacity to pay the debt.Wages and other income are major determinants here.
3).Collateral -Possessions that might be used to secure the debt are classed as collateral.For a mortgage, the home purchased is used as collateral.4).Capital -A well-resourced individual is more likely to be granted a loan.5).Conditions -Current and projected economic conditions are also taken into account.
A number of factors led to the introduction of automated credit scoring in the 1940's and according to Durand (1941), at the end of World War II there was an explosion in the demand for credit and it became clear that the subjective methods did not scale well to large numbers of applicants.The credit explosion, spurred on by the introduction of credit cards a few decades later, motivated lenders to automate the credit granting decision giving birth to objective credit scoring systems.In parallel with the growth of credit demand, increases in computing power made it possible to analyse large quantities of data with (relative) ease.More recently, the development of scoring systems has been driven by the regulatory environment.As a part of the capital adequacy requirements placed upon banks with the introduction of the Second Basel Accord (Basel Committee for Banking Supervision, 2001), institutions are required to closely monitor the risks associated with their loan portfolios.Since the introduction of the first credit scoring systems, a number of statistical and mathematical methods have been used.Most techniques have a statistical background, such as Markov Analysis, Linear Regression, Logistic Regression and the Buckley James method.
The credit score is determined by a complex formula that takes into account many different factors.Credit scoring models compute a person's score primarily from information contained in his credit report.The models might also take information from credit applications into consideration, including the person's age, time with bank (months), number of dependants, time at current address (months), time at current job (months), sex, refinancing of other financial institution's loan flag, self-employed flag, marriage status and purpose of loan.The person's payment history reflects the various accounts that he has, including credit cards, mortgage loans, and retail accounts.Collections, foreclosures, lawsuits, and other collection items also fall into this factor.Each factor is given a weight (Credit Risk Scoring Analytics, Issue No: 0710511).
Historically, a credit officer uses information relating to the creditworthiness of an applicant to determine whether or not to grant a loan.Current credit scoring systems work in much the same, although objective, way.Assume that the customer population consists of two classes, good and bad.The information that a customer provides when they apply for a loan is used by banks to determine which group the customer is likely to belong to.Rather than being examined in a subjective way, the information is coded to form quantitative variables that can be input into a statistical model.For an individual, if there are k explanatory variables, they are collected as a vector, to form the input to the model.The explanatory variables can then be used to produce a score to estimate the probability, p, of that individual belonging to the good or bad class.The relationship between the explanatory variables and the probability of default is usually found by fitting to a historical set of completed loans, some of which are bad.
Credit scoring techniques were originally developed to help organisations automate the credit granting decision.As a result, the primary aim of a traditional credit scoring system is to classify potential customers as either being good or bad so the appropriate action can be taken.A bad customer may be deemed as one who fails to repay the loan in full, but this definition can be expanded to cover a range of undesirable behaviour.Surveys by Rosenberg and Gleit (1994), Hand and Henley (1997) outline the different modelling techniques that can build such systems.
The definition of bad can be somewhat arbitrary and is often driven by regulatory demands.While the definition can include early repayment, churn, or fraudulent activity, the most common definition of bad is default.Default could be taken as one missed payment, three consecutive missed payments, or perhaps when the debt becomes unrecoverable.If the definition of bad is too stringent, or not stringent enough, it may have a negative impact on the quality of the final scorecard (Siddiqi, 2005).

Methodology and Data
Assume that a body of loans data has been collected with n data points, perhaps it was collected to construct a classification score card.Typically, CBZ would use the data to predict the chance of that loan being in default at some cut off date (Siddiqi, 2005).Now suppose that instead of estimating the probability that the loan will go bad, we wish to estimate the time to default, T d and time to repayment, T r , where time to default and time to repayment are assumed to be independent events.As the observation period may end while the loan is still underway there will be censoring at t months, where t is the length of the maximum observation period for the loan.The total observed time of the loan, T is then If two events are assumed to be independent, then the overall hazard function for the loan can be expressed as where h d and h r are the individual hazards for each mode of failure.One implication of the independence assumption is that when estimating time to default, early repayment is viewed as a censoring mechanism.Time to repayment can be modelled separately and any defaults viewed as censored observations.Hence, for modelling default, the censoring indicator is given as: and for modelling repayment: Figure 1 shows different mechanisms acting when considering default and how they affect the coding of the data.
Figure 2 shows how the same events are coded differently for repayment.The hollow circle refers to a censored point.LDA will be used to classify customers into classes that is either good or bad.A score, Z, will be constructed which is a linear function of the explanatory variables x, Where m G and m B are the vector group means for the good and bad classes respectively and Σ is the common covariance matrix.
Discriminant function analysis determines which continuous variables discriminate between two or more naturally occurring groups.In LDA, the explanatory variables are the predictors and the dependent variables are the groups.LDA is usually used to predict membership in naturally occurring groups.It answers the question: can a combination of variables be used to predict group membership?Several variables are included in this study to see which ones contribute to the discrimination between groups.
Discriminant function analysis will be broken into a 2-step process: 1).Testing significance of a set of discriminant functions.
This step is computationally identical to MANOVA.There is a matrix of total variances and covariances; likewise, there is a matrix of pooled within-group variances and covariances.The two matrices are compared via multivariate F tests in order to determine whether or not there are any significant differences (with regard to all variables) between groups.Multivariate test is performed firstly, and, if statistically significant, proceeds to see which of the variables have significantly different means across the groups.Once group means are found to be statistically significant, classification variables is undertaken.LDA automatically determines some optimal combination of variables so that the first function provides the most overall discrimination between groups, the second provides second most, and so on.Moreover, the functions will be independent or orthogonal, that is, their contributions to the discrimination between groups will not overlap.The first function picks up the most variation; the second function picks up the greatest part of the unexplained variation, etc. Computationally, a canonical correlation analysis is performed that will determine the successive functions and canonical roots. 2).Classification.
Classification is then made from the canonical functions.Subjects are classified in the groups in which they had the highest classification scores.The maximum number of discriminant functions will be equal to the degrees of freedom, or the number of variables in the analysis, whichever is smaller.One of the main criticisms of linear discriminant analysis as a credit scoring method involves the assumptions of distributional form (Eisenbeis, 1978): Firstly, the assumptions require that the covariance matrices of the predictor variables are equal for the two groups; furthermore, the predictor variables are required to follow a multivariate normal distribution.In credit scoring applications the predictor variables are often discrete or follow otherwise non-normal distributions.This clearly violates the second assumption.However, even if the normality assumption is violated, linear discriminant analysis is still widely applicable in separating groups and that the violation only affects the validity of significance tests (Hand & Hanley, 1997).
When interpreting multiple discriminant functions, which arise from the analysis of than two groups and more than one continuous variable, the different functions are first tested for statistical significance.If the functions are statistically significant, then the groups can be distinguished based on predictor variables.Standardized β coefficients for each variable are determined for each significant function.The larger the standardized β coefficient, the larger is the respective variable's unique contribution to the discrimination specified by the respective discriminant function.In order to identify which independent variables help cause the discrimination between dependent variables, one can also examine the factor structure matrix with the correlations between the variables and the discriminant functions.The means for the significant discriminant functions are finally examined in order to determine between which groups the respective functions seem to discriminate.

Linear Regression
Linear regression models will be used to formulate a credit scoring model, assume a linear model where the probability p that an applicant is bad is related linearly to k explanatory variables, Where β is the vector of parameters (β 1 , β 2 ,…, β k ).

The Buckley James Method
The Buckley James method will be used to correct any bias present in linear regression with censored data by replacing censored points with their expected values, Where b is the arbitrary slope to be estimated by the algorithm, Y is the survival random variable, δ is the censoring indicator, Y i x is the response for the i th observation and t i is the censoring time for the i th observation.

Simulation
Monte Carlo Simulation is used to compare Linear and Buckley James regression and then select the best model to use to calculate the risk involved in loan lending.Monte Carlo simulation, or probability simulation, is a technique used to understand the impact of risk and uncertainty in financial, project management, cost, and other forecasting models.Thus in our case simulation is used to compare all the regression methods used in this study.

Data
Key characteristics about debtors and debts includes: residential status, employment status, marital status, time at address, time in occupation, time at the bank, loan purpose, sex and age.Monthly performance data for each loan was recorded from the time each loan was opened until January 2012.The monthly performance data for each loan included whether the loan was still under way, whether the loan was more than 30 days in arrears, or if the loan had been fully repaid.However the data was for 200 customers who had either defaulted or repaid their loans.
The general format of the monthly performance data was supplied as a Structured Query Language, (SQL), dataset; SQL being the data analysis package used by CBZ.Rejected applicants were not included in the data because no reject inference was to be carried out.For each month, a loan could be (G) Good, (B) Bad, or closed (blank value).Good refers to a loan that was not 30 days behind in repayments.Bad refers to a loan that, at any time prior to that month, had been more than 30 days behind in repayments.
Take for example Loan 1 which is in the first data row of Table 1 above, it was opened in January, 2010 as that is the position of the first G, and was closed (repaid in full) in April, 2010.Hence the survival time, z, for this loan was 3 months, and because repayment was the observed mode of failure, δ r = 1.The loan was not seen to default, accordingly δ d = 0.

Results
The package SPSS was used for analysing data using linear regression, Buckley James regression and linear discriminant analysis.In the following subsections are the results obtained and discussions.

Scatter Plots
Figure 3(a) shows a positive linear relationship between time to default and age.Age consists of two groups the young (< 50 years) and the old ( ≥ 50 years).For older people default is very high as compared to the young ones.This is because of the fact that the young ones have more years to work, they are ambitious, they want more assets and they are energetic and are more willing to work unlike older people.With the current economic Zimbabwe situation, pension given to retired workers is very low to cover for major expenses thus this puts older people at a default risk.In Figure 3(b) no relationship is seen between time to default and sex.This is because both males and females have an equal opportunity of getting the same income.Even women have their own business because of women empowerment programmes allowing women to work and even run families thus everyone is likely to default despite the fact that they are males or females.Figure 3(c) shows a relationship between time to default and marital status.Those that were single (1) were at most risk of defaulting than the married (2) people.For the married, the spouse could help with the finances unlike the single who face paying back on their own.
Loan purpose was categorised into groups which are: 1). High risk purpose loans, that is loans for starting a business.
2).Medium risk purpose loans, that is loans for buying a car.
3).Low risk purpose loans, that is loans for paying school fees.
For high risk loans default occurs within the first months, could be because of the business plans that have not failed as such but taking off is hard and income starts flowing in later and thus the customer has a zero income for that month and the ones to follow up to 9 months.For medium risk loans default starts from 7 months to 16 months.For low risk loans default starts from a year upwards.Low risk loans have most clients because they are easy to pay back and most customers with low income jobs can afford to repay the loan on monthly basis.In Figure 3(d), there is no relationship between time to default versus time with bank meaning that default is not influenced with the time a customer has with the bank.It shows that whether a customer has 2 years or more with the bank it does not mean that they will not default, the same applies for less time with the bank.Default time is not predictable when considering time with bank.In Figure 3(e), there is a positive linear relationship between time to default and time at current job.There was positive correlation.Time at current job shows how stable a customer is, thus, the more stable one is, the less the chances of defaulting.Customers have a tendency to default, through mostly salary divert but if they have proof of where they are working the employer can help clear the credit of the customer by paying the customer's salary directly into the bank.
Figure 4(a) shows a linear relationship between time to repayment and age.For older people repayment is very low as compared to the young ones, can relate to Figure 3(a).According to Figure 4(b) no relationship is seen between time to repayment and sex. Figure 4(c) shows a relationship between time to repayment and marital status.Those that were married (2) took longer to repay than the single (1) people.Figure 4(d) shows that Repayment was feasible in all the loan purpose categories.Figure 4(e) shows no relationship between time to repayment versus time with bank meaning that repayment is not influenced with the time a customer has with the bank.Figure 4(f) shows a positive correlation between time to repayment and time at current job.Time at current job shows how stable a customer is, thus, the more stable one is the higher the chances of early repayments.Following are models were obtained after removing the insignificant variables that is sex, time with the bank and loan purpose    An eigen value indicates the proportion of variation explained between-groups sums of squares divided by within-groups sums of squares.The larger the eigen value, a value ≥ 1.1, the stronger the function and the discriminatory power.Therefore from Table 7 we have a stronger function since our eigen value is 1.177.The canonical relation is a correlation between the discriminant explanatory variables and the levels of the dependent variable.A high correlation indicates a function that discriminates well.The present correlation of 0.888 is extremely high between the dependent variables, that is, time to default and time to repayment and the significant explanatory variables.Wilk's Lambda is the ratio of within-groups sums of squares to the total sums of squares.This is the proportion of the total variance in the discriminant variables not explained by the differences among groups.From Table 8 Lambda is 0.85 which is close to 1 this shows that the group means are almost equal to 1 and all the variance is explained by factors other than difference between those means.Here Lambda has a significant value, thus, the group means appear to differ.67.5% of the original cases were correctly classified.
In Table 9, a classification result is a simple summary of number and percentage of subjects classified correctly and incorrectly.For our data 67.5% of original grouped cases were correctly classified meaning that 32.5% customers were misclassified.High losses incurred since the cost of misclassification is the same for both groups, this prompts bank failure.

Buckley James Results
Buckley-James estimation was done in simple linear regression applied to the unsecured personal loans data.The number of limiting values of Buckley-James estimates exhibits chaotic behaviour.Table 10 shows a summary of the coefficient values after the bias had been removed from the linear regression models.In the time to repayment model using Buckley James time at current job was insignificant at 5% level of significance showing that there is poor score assignment for the variable.
From  12) Loan purpose is negatively correlated with credit score that is low risk loans reduces the credit score more than high risk loans.On the other hand for repayment low risk loans increase the score more than high risk loans.Age, Marital Status and Loan Purpose were found to be crucial in both defaulting and repayment model and Time at current job was found to be a significant variable in the defaulting model.
Observing the linearities, this, therefore means that CBZ should reconsider reassigning scores given to the explanatory variables that showed linearity in the credit score sheet as these values are influential to time to default or repayment.
To correct the bias present in linear regression model, the Buckley James method was used and it also performed well in simulation.Survival analysis was applied to the personal loans to estimate the time to default or to early repayment.

Recommendations
The commercial bank of Zimbabwe should try and observe the loan performance of each customer and act as soon as the loan goes bad.It is suggested that the bank should establish a credit risk management team that should be responsible for the following actions that will help in minimising credit risk; • Reconstructing the credit score sheet and reassign scores to all the variables that affect defaulting and repayment.
• Implementing the Buckley James method, as it proved to be better performing.
• Reconsidering the minimum age for a loan applicant, as the study showed that 21 years is not valid for loan application.
• Reviewing the customers that fall under single and married in the credit score sheet as there are widows and widowers.
• Closely monitoring the loan performance of each customer taking survival analysis into consideration as well.

Table 1 .
Loan performance

Table 2
shows the results from converting the examples in Table1into survival times.This conversion is needed because most survival regression programs require the data to be expressed as a combination of survival times and censoring indicators.A loan is generally classified as bad if it is in default at any stage in the 12 months after opening.If the loan is fully paid off, or is still under way at the 12 month cut-off, the customer is classified as good

Table 2 .
Survival times of the loans

Table 7 .
Summary of canonical discriminant functions

Table 9 .
Classification table for loans

Table 10 ,
Sex and time with the bank have sig.values>0.10 indicating that they do not contribute to the discriminant model otherwise all values are significant.Table 10 also suggests age as best variable followed by time at current Job, marital status and loan purpose.It was found that older people and single people have high risk of default.According to Buckley James defaulting model (