• Ei tuloksia

6 Case: identifying and predicting borrower default and comparing results between

6.7.3 Random Forests

Random forests model was trained using MATLAB’s fitcensemble function. This uses random forests as default, so it was not needed to specify it beforehand. This method has many parameters that can be optimized. Selected ones are as follows: Number of learning cycles, Number of variables to samples, minimum leaf size, maximum number of splits, split criterion.

Appendix 9 contains misclassification of in-sample and 10-fold CV for optimized and non-optimized models. Here we can see that in-sample of non-non-optimized model perfectly captures the training data in Spanish and Finnish dataset and almost perfectly in Estonian data. This is concerning in terms of overfitting, but the 10-fold CV model performs more realistically. Also, hyper parameter optimization provided better results compared to non-optimized 10-fold CV model. It improved whole data, Estonian, and Spanish data model predictions by 2 %. But most change was in Finnish dataset which improved by 6 % approximately. Hyper parameter optimization seems to improve RFs performance very well. Therefore, 10-fold CV parameter optimized model is used in evaluation phase. Appendix 10 contains the information of optimized hyper parameters for each country.

Evaluation of the models and countries predictions

Now that models are trained properly, the actual predictions of the test datasets can begin.

Prediction results are calculated using confusion matrix. All numbers are derived from that. All used evaluation metrics are introduced in chapter 5.5.1 and 5.5.2. Good measures to keep an eye on are, accuracy, sensitivity, type 2 error, and AUC. Accuracy is a good overall measure that gives indication of the model performance with one glance. Sensitivity is important since it considers true positives of the confusion matrix, meaning it considers predicted defaults.

Type 2 error is also important since this is the costly mistake of the prediction model, meaning samples predicted as good when they actually default. AUC is the most important performance measure since it considers true positives and false positives.

Table 6 contains performance measures of logistic regression for each country. All these measures seem to decrease while data set is further broken down to countries with least samples. Estonia has the most and Finland the least samples. It seems that country with most samples get the best prediction results. This makes sense since supervised prediction models are better if they are fed more samples. But even though the differences of default rates were

71 large in original data between countries (Table 4), these models show that they can be predicted to somewhat similar level. Finland and Estonia have the largest difference in prediction performance, but Spanish model seem to be relatively close to Estonian level.

Whole data set predicts the best though. Estonia has smallest type 2 error and highest sensitivity which is interesting. This means that this model has predicted defaults the best and has smallest error to wrongly assign non-default class to actual defaulter.

Table 6. Evaluation metrics of logistic regression

Table 7 contains evaluation metrics of SVM for each country and whole data set. Same pattern appears here that the dataset with most samples has best overall prediction performance in terms of accuracy and AUC. SVM seems to be a lot better when it is run on whole dataset.

Dividing it to different countries reduces its prediction capabilities significantly.

Table 7. Evaluation metrics of SVM

72 Table 8 contains evaluation metrics from random forests model. Accuracy and AUC metrics are the same as previously. Whole dataset gets the best prediction performance, and it decreases as the dataset changes to a country that has fewer samples. Interestingly models sustain higher level of prediction performance when RF is used. Furthermore, sensitivity is the highest in Finnish dataset which is intriguing. It has the lowest sample size, and it had the highest default rate originally before data sampling. Also, type 2 error is lowest in Finnish dataset. This means that RF model seems to predict default class best on Finnish dataset.

This is an interesting finding since the assumption is that bigger dataset would fare better in prediction of every class. More importantly it is precisely default we want to avoid and specifying data to regions seems to help it at least in this model’s case. This gives a small indication that specifying countries in model prediction might be beneficial. Bigger sample size might even make the differences clearer.

Table 8. Evaluation metrics of random forests

Now that each classifier is evaluated independently, all models are compared to decide which one performs the best. Table 9 contains three previous tables in one, so it is easier to spot performance differences. Overall, when looking all values, random forests model is the best one. It has best overall values in many evaluation metrics and in each country dataset. Also, all metrics seem to be more stable between countries when using random forests. Other models seem to vary a bit more. Interestingly, logistic regression model for Estonian dataset seem to beat random forest model for Estonian dataset completely. Furthermore, Best values for each evaluation metric is bolded so it is easier to spot differences. In terms of AUC, and F-measure RF performs the best. These metrics were the best “single number” metrics to

73 evaluate to determine the best prediction model. So, from these three methods, RF is the king in stability between country datasets and in best metrics.

Table 9. Evaluation metrics for all models

Determinants of default

It is important to recognize variables that has the most effect on default probability. But it is necessary to know how they effect on default probability. For this reason, logistic regression was run on all datasets using glm function on MATLAB which provides estimates, standard deviation, t-statistic, and statistical significance (p Value) for each variable. The function was run on sampled datasets, also including feature selection so these datasets contain 30 most important variables. These results can be found from appendices 11-13.

Only estimates, p values and 10 most important variables, chosen by feature selection, are shown in table 10. Bolded variables are selected to top-10 in each dataset. There are 6 reoccurring variables, and these are: Rating, Monthly payment, No. of previous loans, New credit customer, Amount of previous loans before loan, and Existing liabilities. As expected, worse rating increases the probability of default a lot since its estimates are all positive and relatively high compared to other estimates. Existing liabilities variable also makes sense, since the more liabilities you have, less money you have for interest payments, so default probability increases.

Evaluation metrics Whole data Estonia Spain Finland Whole data Estonia Spain Finland Whole data Estonia Spain Finland Accuracy 0.697 0.667 0.627 0.582 0.700 0.631 0.616 0.558 0.698 0.656 0.629 0.626 Type 1 error 0.264 0.332 0.339 0.441 0.272 0.356 0.387 0.466 0.281 0.337 0.373 0.437 Type 2 error 0.342 0.334 0.406 0.395 0.328 0.382 0.380 0.419 0.322 0.352 0.369 0.310 Misclassification 0.303 0.333 0.373 0.418 0.300 0.369 0.384 0.442 0.302 0.344 0.371 0.374 Sensitivity 0.658 0.666 0.594 0.605 0.672 0.618 0.620 0.581 0.678 0.648 0.631 0.690 Specificity 0.736 0.668 0.661 0.559 0.728 0.644 0.613 0.534 0.719 0.663 0.627 0.563

G-mean 0.696 0.667 0.626 0.581 0.700 0.631 0.616 0.557 0.698 0.656 0.629 0.623

Precision 0.713 0.667 0.636 0.579 0.712 0.634 0.615 0.556 0.707 0.658 0.629 0.613 F-measure 0.684 0.667 0.615 0.592 0.691 0.626 0.618 0.568 0.692 0.653 0.630 0.649

AUC 0.761 0.728 0.663 0.606 0.758 0.689 0.634 0.597 0.781 0.723 0.700 0.666

Random forests (RF) Logistic regression (LR) Support vector machine (SVM)

74 Table 10. 10 most important variables for each dataset

Rest of the main six variables have interesting effects on default. Monthly payment variable has decreasing effect on probability of default which sounds flawed. Since more payments you make every month, should result in bigger difficulty to survive from all liabilities. But, in this case it seems to be the opposite. Interestingly, amount and interest variable, which should increase the monthly payment, have the opposite effect. They increase the probability of default. This seems very conflicted, so there might be something more to monthly payment variable than initially thought. Furthermore, chi-square feature selection failed in this variable for Spanish dataset. This should not be the most important variable since the estimate is almost zero and p value is near one which means it is far from statistical significance. Also, very interesting variable is No. of previous loans before loan since it has a negative effect on default probability. So, more loans you have had decreases chances to default. This could be explained with borrowing experience as mentioned before in feature selection chapter. More experience you have with borrowing, more likely you know how to get through of all payments.

This is also backed up with other important variable, New credit customer which has positive estimate. Although, it is not statistically significant variable in country datasets. This variable means that borrower is new to Bondora, and the inexperience leads to higher probability of default. Amount of previous loans before loan is intriguing as well. Since the estimate is

75 positive, this variable has an increasing effect on default probability. Which is kind of odd since number of loans had opposite effect. But maybe if the amount of previous loans is very high, risky borrowing behaviour is much more probable.

Some variables, that are not reoccurring, have also interesting effects. Gender is selected as important in whole data and Spanish dataset. In both, it seems that female gender has higher probability of default since female = 1 and male = 0 in this variable. In Spanish data it is somewhat expected since the culture is more patriarchal, meaning men do financials more, so women might have less experience. When compared to Finnish and Estonian data, Gender variable is at the bottom of 30 important variables, which indicate that these countries have more gender equality. Also, in both datasets, gender variable has negative estimate, meaning that higher value decreases default risk. Which means that women default less. This is more common result in countries that have more gender equality since, in general men have riskier behaviour compared to women. So, gender variable occurring in whole datasets top 10 is because of Spanish datasets importance for variable.

Interestingly, interest, applied amount and amount variables, which all indicate riskier borrowing, are not statistically significant variables in any datasets. Education variable also takes logical path, which is more education = lower default probability. Verification type variable indicates more verified income with higher values. This makes sense as well since the estimate is negative, meaning if income is more verified, default probabilities decrease. Loan duration appears in many datasets as important variable, and it also has logical effect. Since estimate is positive, it means that longer loan duration leads to increased probability of default.

Last variable is use of loan: Loan consolidation which only appears in Estonian data. It has interesting effect since this variable means that borrower finances debts with more debts.

One might think that this is very risky borrower behaviour, but instead it has negative estimate, meaning it decreases default probability.

Same variables tend to appear in different countries datasets. Still, there are differences in the ordering of most important variables and there are some interesting variables that only appear in one country’s dataset. This indicates that countries indeed have differences, and it might be beneficial to evaluate them separately to get better results in prediction.

76