Creating sub datasets for each country - Case: identifying and predicting borrower default

6 Case: identifying and predicting borrower default and comparing results between

6.2.5 Creating sub datasets for each country

Dataset is split to each country in a way that each country is its own sub dataset now. Now there are three sub datasets for each country. Estonian with 6470 samples, Finnish with 2524 samples, and Spanish with 3839 samples. Original dataset is also kept intact for comparison purposes. Also, country variables were now removed from the datasets since they have served their purpose resulting in total of 62 variables.

Descriptive statistics

The final dataset consists of 62 variables, but most of them are dummy variables. Without them the number of variables would be 28, but many of the categorical variables need to be in dummy form to be analysed properly. Target variable default is being examined at Table 4. As one can see, default percentages vary a lot between countries. This is exactly what this thesis is about, recognizing the reasons why these differences are so large. Estonia has the most borrowers in the data. Since the differences in sample sized are significant, balancing should be included. This will be done by using random under sampling method (RUS). Sample sizes for each country should be large enough for evaluation. Also, noteworthy thing is that default rates seem to be very high. This can be due to the reason that many of the borrowers that

64 borrow through P2P lending has been declined to borrow money via traditional banks. So, naturally default rate should be higher in P2P lending since borrowers tend to have more risk overall. Furthermore, P2P lending sites probably do not have as efficient borrower screening compared to banks, which will result in riskier borrower acceptance.

Table 4. Class frequencies of target variable between countries

Appendix 1 contains all the variables used in this research. The table gives an overall impression of the used variables and what categories they relate to most. It consists of borrower’s demographics, borrower’s financials, loan characteristics, and borrower’s credit history variable groups. Also, it is noteworthy to mention that 13 out of 28 variables are categorical, which is a relatively high ratio.

The descriptive statistics of numerical variables are in Appendix 2. The average age of the borrower is 37,74 and usually loan appliers do not have any dependants since median value is 0. Loan amount on average is 2354,30 €, so small loans are usually favoured. Loan duration average value is 35,63 which means that most people apply for 3-year loans. Financials of the borrowers seem to be quite weak. As we can see, average total income is 1368,47 € which is quite low. Also, total liabilities are 793,74 € on average which is out of the total income. This leads to free cash average of 503.49. € Some borrower even has almost zero free cash since the min value is 0.25 €. This is also noted in interest rates which is 38.91 % on average. Also, maximum accepted interest rate is 254.84 % which is extremely high. Debt-to-income levels are on acceptable levels though with a mean of 27.73 %. All of this leads to extremely risky

Default Count Percent

65 borrower behaviour since there is not much free cash available for payments of the loan. This explains a lot why the default rates are so high.

Credit history of the borrowers has acceptable levels. Many of the borrowers do not even have previous loans as the mean is 0,42. Also, median values of both number, and amount of previous loans are 0. This means that many of Bondora’s borrowers are borrowing for the first time. Even though it is good that many borrowers do not have previous loans, this might result in inexperience of borrowing situations. It may come as a surprise that interest rate payments are quite high, and the result is default. On average monthly payment is 127,83 € but the max value is 1377,76 €.

Most of the continuous variables used are skewed or leptokurtic. This can be examined from Appendix 2 skewness and kurtosis numbers and from visualization of Appendix 3 histograms.

Most of the financial variables are strongly skewed to the right. This can affect machine learning models since they usually require normal distribution. But in this case SVM, LR and RF are used and none of them require normally distributed data, so skewness will not be an issue.

Most of the categorical variables are already coded as dummy variables and it would be too much to include all dummies in descriptive stats and visualization. So, to analyse descriptive statistics, data before dummy variables is used. This means that outliers have not been removed and some of the high cardinality categories are still present. These categories are identified in the process. Appendix 4 contains all categorical variables, their frequencies, and percentages. Some categories contained just few or no samples so these were removed.

These categories were EmploymentStatus: Unemployed with 13 samples, EmploymentDurationCurrentEmployer: Other and Retiree with 0 samples, and HomeOwnershipType: Homeless with 1 sample. Also, Country: Slovakia with 218 samples was removed from final data since it does not have enough samples to be its own dataset as other countries.

Class percentage indicates that 75,80 % are new customers in Bondora. Also, very big portion of borrowers’ income has not been verified at all 34,63 %, so screening of borrowers seems to be a big issue in P2P lending. Majority of the borrowers are from Estonia. Home improvement

66 is the most popular reason to take a loan with 25 %. Most of the samples have passed secondary education 37,81 %. A large majority of borrowers are fully employed with 81,19 % and most of them have also been more than 5 years with current employer 37,24 %. Work experience in years seems to be rather evenly distributed except 5,09 % of the samples have less than 2 years of experience. Most of the borrowers also own a home with 30,18 %. HR (high risk) rating provided by Bondora is the most common rating with 33,36 % which further demonstrates the high-risk-nature of P2P lending.

Balancing of the data using RUS

As we can see from Table 4, countries have different ratios of defaulters, which can become misleading when calculating the metrics. Therefore, random under-sampling was performed.

This means that samples that are in majority class are removed so that default and non-default classes are the same size. This will result in loss of information, but it is a better alternative than inaccurate models. There was no straight function for random under-sampling, so MATLAB was told to find majority samples and remove them randomly until it is the same size as minority class. This resulted in data sizes of 4736 samples of Estonians, 1808 samples of Spanish and 1650 samples of Finnish borrowers. All the datasets now have 50/50 ratio of defaulters. Also, whole dataset was sampled.

Feature selection: Chi square

For feature selection, chi square method is used. With feature selection, most of the unnecessary variables can be filtered out of the model to increase performance. This will result in fewer variables but more efficient model. The use of chosen method is justified in chapter 5.1. Chi-square was calculated using MATLAB’s function fsschi2(). It constructs feature ranking based on the give data and the target variable. The score given by the function is a logarithm of p-value provided by chi-square test. Sometimes this value is infinite.

The results of chi-square can be seen in Figure 14 for each country and the whole data. The scores are plotted in descending order. As we can see from the graph, the scores vary a lot between countries. For example, Estonia has only two very important variable while Spain and

67 Finland have multiple. Also, whole data set has one infinite variable which is represented in purple colour. Also, variable importance decreases relatively fast for each dataset, which means that there are only few important variables that contribute to default. 30 variables seem to be the sweet spot for each dataset since the score tends to be very low after 30 variables.

Figure 14. Visualization of Chi-square feature selection scores

Table 5 represents top ten of the most important variables for each country. Same variables pop out in the rankings, but they are in different order. Two most important variables seem to be rating provided by Bondora and monthly payment. Interestingly, Spain does not have rating ranked at top 3. It is at seventh place. Also, reoccurring variables are credit history variables like, number and amount of previous loans and new credit customer. Furthermore, loan characteristics are important too since loan duration and loan amount keep occurring in all datasets. It is curious that interest rate is important on the whole dataset but when divided to countries, it does not even reach top ten. What is more, existing, and/or total liabilities occur frequently in each dataset. From borrower characteristics, gender and education occurs frequently. Only Finland does not have either of these two in top ten. Finland has one peculiar

68 variable and that is home ownership type: tenant pre-furnished property. My guess would be that this variable contains risky demographic groups, like students, who already have dire monetary situation so borrowing is very risky. Also, loan consolidation is occurring only once in Estonian data, which makes sense since this means that borrower uses the loan to pay other debts which is risky behaviour. Interestingly, income was left out of top ten.

Table 5. Ten most important predictors for each country

Since figure 13 shows that information gain after 30 variables is very small, this could be the number of variables to run in models. There are multiple ways to determine optimal number of variables for each model. But since the point of this research is to examine differences between countries, a fixed number of predictors makes sense. It means that all countries and models have the same starting point model and country differences can be compared in more comprehensive manner.

Data split

The data split is done using cvpartition function in MATLAB. In this function it is possible to specify what method to use. In this case holdout method of 70-30 split was used. 70 % of the data is used for training and hyperparameter optimization and 30 % is used for testing models.

Also, stratify command was used to keep the distribution of classes in 50 % in both training and test data. K-fold cross validation is used in model training phase on training data to ensure results validity.

Rank Whole data Estonia Spain Finland

1Rating Rating Monthly payment Monthly payment

2Monthly payment Monthly payment Number of previous loans Rating

3Number of previous loans Loan duration New credit customer Amount of previous loans before loan 4New credit Customer Refinance liabilities Amount of previous loans before loan Number of previous loans

5Amount of previous loans before loan New credit customer Loan duration New credit customer

6Interest Number of previous loans Liabilities total Refinance liabilities

7Gender Education Rating Existing liabilities

8Applied amount Amount of previous loans before loan Education Amount

9Existing liabilities Use of loan: Loan consolidation Amount HOT: Tenant pre-furnished property

10Amount Existing liabilities Gender Loan duration

Hyperparameter optimization and model training

Now that the data is pre-processed and features selected, hyperparameters of the model can be optimized and the final specifications for models can be done. In this chapter hyperparameter optimization and training process is described.

In document Default prediction in peer-to-peer lending and country comparison (sivua 72-78)