Repeated Buyers

Posted by VitoDH Blog on January 16, 2019

Predicting Potential Repeated Buyers

Author: Dehai Liu

Department of Mathematics, Sun Yat-Sen University

1. Abstract

Our project mainly focus on digging out the potential online repeated-buyers with data-driven methods and state-of-the-art algorithms. It can not only provide guidance for the sellers about their operation,but also help the online shopping platform to pinpoint the target customers for advertisements and coupons.

2. Data Description

T-mall is one of the largest online shopping platform in China. On November 11th, “the Single Day”, the trading volume on T-mall reaches 120 billion CNY, indicating tremendous profit behind this shopping festival. In this project, our data includes the shopping records of the buyers for six months before the and on the Single Day. The data is divided into 3 parts as follows:

User Log

Attributes Definitions
user_id Unique ID for user
item_id Unique ID for commodity
cat_id Unique ID for the category of the good
merchant_id Unique ID for the merchant
brand_id Unique ID for the brand
time_stamp The date of action given by customer
action_type 0:click, 1: add to cart, 2: order, 3: add to favorite

User Info

Attributes Definitions
user_id Unique ID for user
age_range 1: < 18, 2: [18,24], 3: [25:29], 4: [30,34], 5: [35,39], 6: [40,49], 7: >=50, 0: unknown
gender 0: female, 1: male, 2: NULL

Training Set and Test Set

Attributes Definitions
user_id Unique ID for user
merchant_id Unique ID for merchant
label 1: repeated buyers, 0: non-repeated buyers

3. Data Preprocessing

(a) Load the data

  • Load use_log.csv and user_info.csv
  • Remove the data points including NA in user_info
  • Load train_format1.csv, pick a subset of 100,000 samples and denote it as trainSet

(b) Outlier Detection

Since the existence of click farming, we need to find out the user and the merchant with unusual clicking behavior and then remove them from the trainSet. Here we offer the scatterplot of four actions for users and merchants, respectively.

User

Merchant(Seller)

Based on the distribution above, we define the data to be an outlier if it exceeds the threshold below:

  User Merchant
Click > 4000 > 200000
Add to Cart None > 250
Buy > 100 > 10000
Add to favorite > 450 > 10000

After removing the outlier of users and merchants, we still have 90917 samples in training set.

4. Feature Engineering

Given a specific buyer and seller, it’s not difficult to find out the times of the four actions that the buyers have done in the store of the merchants. Here we define a 6-dimension vector to denote the direct link between the user and the merchant:

(1) First we define the weight for different action types:
Action Value Times Weight
Click 0 n_1 0.1
Add to Cart 1 n_2 0.2
Buy 2 n_3 0.3
Add to Favorite 3 n_4 0.4
(2) Vitality and Popularity

Vitality is used to measure the extent of how much the user loves shopping. Popularity refers to how attractive the merchant’s commodity is. They are both calculated by the weighted average of the four actions. And they can be specifically defined as category vitality, brand vitality for the user and category popularity, brand popularity for the merchant.

Now we illustrate the calculation by taking the category vitality as an example.

The score of a good for a given user is

The category vitality of user is

where refers the item that is relevant to the user .

Similarly, we can calculate the other three indicators and combine them into a 4-dimension vector.

c. Normalization

After setting up the features, we find out the each feature have different scales. Thus, it would be reasonable to scale the attributes to [0,1] using the following formula:

d. PCA

Based on part a and b, we have obtained 10 features. In order to simplify the training process and remove useless information, we perform PCA on the training set. The scree plot and the variance of components are given as follows:

Noting that the cumulative proportion has reach 0.94 on the 4th principal component, we can simply pick the first four principal components as our attributes for training.

5. Balance

Taking a glance at the distribution of the label,

Type Number
Total Sample 90917
Positive Label (1) 5363
Negative Label (0) 85554
Number of User 75053
Number of Merchant 1982

From the above table, the positive samples only covers 5.9% of the total samples, which will easily lead to a situation that all the positive label will be classified as negative.

We use four ways of sampling to address this problem and obtain a balance dataset.

Label 0 1
Raw 84700 5300
Over Sampling 84700 84700
Under Sampling 5300 5300
Over and Under 44959 45041
SMOTE 44959 45041

6. Training with XG Boost

XG Boost is an cutting-edge algorithm derived from GBDT , which can deal with missing data and avoid overfitting.

a. Parametrization

Parameters Value
max_depth 5
learning_rate 0.1
max_iter 800
learning_function logistic

b. Performance under different sampling

(1) Over Sampling

  Train Test
Precision 0.783 0.128
Recall 0.846 0.444
F1 Score 0.814 0.200
F2 Score 0.832 0.297
AUC 0.885 0.603

(2) Under Sampling

  Train Test
Precision 0.833 0.08
Recall 0.862 0.667
F1 Score 0.848 0.144
F2 Score 0.856 0.270
AUC 0.929 0.549

(c) Both

  Train Test
Precision 0.817 0.119
Recall 0.846 0.460
F1 Score 0.832 0.190
F2 Score 0.839 0.293
AUC 0.909 0.609

(d) SMOTE

  Train Test
Precision 0.727 0.143
Recall 0.687 0.460
F1 Score 0.706 0.218
F2 Score 0.695 0.318
AUC 0.789 0.641

7. Conclusion

From the results above, we are able to conclude that:

  • Model successfully captures the information in the dataset, represented by high F1 score and AUC in the training set.
  • Model can be used to detect whether a buyer will come back again to a specific online store as long as the data between them is given.
  • For improvement in the test set, we need to focus more on the feature engineering part and the sampling part.