Default Detection on P2P lending
Author: Dehai Liu
Department of Mathematics, Sun Yat-Sen University
1. Abstract
This project mainly focus on data mining in the P2P lending data. Based on LDA topic model, we can extract information from the loan statement given by the borrower. Combined with the traditional features (gender,age,target value,etc) in default detection, we are able to predict the probability of default with random forest, which demonstrates high accuracy and straightforward interpretation.
2. Data Description
Ren Ren Dai
(https://www.renrendai.com/) is one of the largest online P2P lending platform in China. Using the crawling software Octopus, I obtain around 10,000 records of online lending with the 20 features and 1 label (default or not default) :
Lending Hard Information
Attributes | Definitions |
---|---|
interest rate | the interest rate of the loan |
target value | the amount of the loan |
lending period | the period of holding the loan |
Lending Soft Information
Attributes | Definitions |
---|---|
repayment method | 1: pay interest first 2: pay interest and principal together |
property loan | 1: short-term turnover 2: personal consumption …. |
loan statement | the statement for loan before application |
Lending Personal Info
Attributes | Definitions |
---|---|
age | continuous variable |
gender | 0: female, 1: male, 2: NULL |
education | 1: high school 2: junior college 3: undergrad … |
marriage | 1: divorce 2: married 3: single 4: widow |
census register | province in China |
income | 1: <1000 2: 1000-2000 … |
house property | 1: yes 0: no |
house loan | 1: yes 0: no |
car | 1: yes 0: no |
car_loan | 1: yes 0: no |
type company | type of company |
industry | 1: IT 2: food … |
scale company | (# of staffs) 1: <10 2: 10-100 … |
type job | type of job |
workplace | the location of job |
time job | 1: <1 year 2: 1-3 years … |
3. Data Preprocessing
(a) Load the data
- Load p2pData.csv
Note : the irrelevant feature in the raw data has been removed
(b) Outlier Detection
Perform outlier detection on continuous variables target value and age with Tukey Mehod, i.e defining the data points out of 1.5 times the Interquartile range as outleir.
Target Value
- Before removing the outlier
- After removing the outlier
Age
- Before removing the outlier
- After removing the outlier
The distribution of the features become less skewed after removing the outliers.
(c) Missing Values
- For ordinal variables, impute the missing values with the median.
- For categorical variables without order, remove them directly since they are just a small portion of the samples.
(d) Scaling
Scale the continuous variable to [0,1] using the following formula:
(e) Split the Data Set and Balance
- Split the dataset into 3 parts, 70% for training set, 15% for validation set and 15% fir test set
- The samples labeled as default only account for 10% of the whole dataset. Thus, the data is obviously imbalanced. Here I address this problem by using SMOTE algorithm to generate a balanced dataset.
4. Topic Model - LDA
(a) Feature Engineering
LDA refers to Latent Dirichlet Allocation. In the LDA context, the process of generating a document can be viewed as follows:
-
From the latent Dirichlet Distribution alpha, we obtain the the topic distribution theta of document d
- From theta, we can generate the topic z for the word in position n
-
From another latent Dirichlet Distribution eta, generate the word distribution beta for the topic z
- Generate the word w_dn from beta
Supposed we have defined the number of topics as , then by LDA we could obtain a topic vector for each loan statement denoting the the probability of the statement being assigned to each topic:
.
This vector can been seen as features of topics and can be combined with the features in part 2.
The pipeline for obtaining this vector is as follows:
- Split the sentence into words (by Ansj for Chinese version), remove stopwords and meaningless noise
- Gibbs Sampling until convergence occurs
- Obtain the topic vector
(b) Choosing the number of topics
Here I use perplexity as the metric for selecting the number of topics. This concept is put forward by the author of LDA, Blei. Perplexity represents the ability of generalization of the model. The smaller the perplexity, the better performance in generalization.
From the plot above, pick the number of topic n to be 40. The features in the topic vector are denoted as Topic1, Topic2, …, Topic 40.
5. Random Forest
Random Forest is a robust classifier which is easy to interpret and implement.
(a) Feature Selection
In this part, I mainly select the top 10 features based on the metrics of Mean Decrease Accuracy and Mean Decrease Gini.
Here I provide the importance of features of two model:
- Model without LDA
- Model with LDA
I notice that the topic features play a important role in the model. (eg, Topic 34 and Topic 1)
The 10 features selected are as follows:
Best Features Set
Attributes | Definitions |
---|---|
education | 1: high school 2: junior college 3: undergrad … |
property loan | property of the loan |
scale company | (# of staffs) 1: <10 2: 10-100 … |
statement length | the length of the loan statement |
target | the amount of the loan |
term | the holding period of the loan |
time job | 1: <1 year 2: 1-3 years … |
Topic1 | |
Topic34 | |
Topic39 |
It’s interesting to take a look at what words are included in the topics that I selected:
Topic 1 | Topic 34 | Topic 39 |
---|---|---|
还款(Repayment) | 希望(Hope) | 投资(Investment) |
信用(Credit) | 想(Want) | 生意(Business) |
工资(Salary) | 还款(Repayment) | 资金(Capital) |
逾期(Overdue) | 谢谢(Thanks) | 开(Start up) |
房屋(House) | 申请(Apply) | 周转(Turnover) |
能力(Ability) | 资金(Capital) | 有限公司(Co,Ltd) |
短期(Short-term) | 房子(House) | 一家(One) |
周转(Turnover) | 支持(Support) | 销售(Sales) |
打卡(Attendence) | 钱(Money) | 朋友(Friend) |
来源(Source) | 平台(Platform) | 银行(Bank) |
Note: The loan statements are all written in Chinese.
Basically, Topic 1 is relevant to the loan, Topic 34 is relevant to the attitude of borrower and Topic 39 relates to investment.
(b) Tuning the parameters
- Select the number of trees in random forest: ntree
When ntree is larger than 100, the error has already been steady. Thus, I choose ntree=100.
- Select the number of candidate features at each split: mtry
The OOB error is minimized when mtry=3. Hence, I pick mtry to be 3.
- Select the number of leaves: maxnodes
Based on the metrics accuracy, F1 score and AUC, I pick maxnode to be 400. Here, I take the complexity of the model into consideration. When maxnode is larger than 400, the improvement of the performance is not significant.
6. Evaluation
(a) Test Set Performance
The ROC curve on the test set is given as:
The comparison of performance on validation set and test set:
Metrics | Accuracy | Precision | Recall | F1 Score | AUC |
---|---|---|---|---|---|
Validation Set | 0.9172 | 0.5853 | 0.9955 | 0.7372 | 0.9595 |
Test Set | 0.9297 | 0.6254 | 0.9911 | 0.7668 | 0.9564 |
Noting that the test set has equivalent performance to the validation set, we can conclude that the model successfully generalizes to out-of-sample data.
(b) Test Set Comparison Between Two Models
To further demonstrate the power of LDA, we can compare that performance of the two models on test set:
Metrics | Accuracy | Precision | Recall | F1 Score | AUC |
---|---|---|---|---|---|
Without LDA | 0.9157 | 0.5850 | 0.9550 | 0.7250 | 0.9330 |
With LDA | 0.9297 | 0.6254 | 0.9911 | 0.7668 | 0.9564 |
Percentage Increased | 1.54% | 6.9% | 3.77% | 5.77% | 2.51% |
Each metric has been improved when we incorporate the topic vectors into the features. Thus, LDA model has a significant impact on the improvement of the feature set.
7. Conclusion
From the results above, we are able to conclude that:
- Model successfully captures the information in the dataset. Even without LDA, random forest has achieved satisfying accuracy on the test set.
- LDA demonstrates its power in the natural language processing. It provides insight for mining the information in the loan statement. And hence the performance of the model increases.
- The features relevant to job and the loan statement play an important role in predicting the probability of default in P2P lending.