Nicolas
April 21, 2016
To validate Module 3 and correctly answer the questions will require that you perform the following exercise first: take the code in the file named module3.R, and modify the probability model such as the predictors taken into account exclusively include (1) recency, (2) the log of recency, (3) frequency, (4) the log of frequency. Also, list the detailed predictions made for the first 10 customers.
data <- read.delim(file = 'purchases.txt', header = FALSE, sep = '\t', dec = '.')
colnames(data) <- c('customer_id', 'purchase_amount', 'date_of_purchase')
data$date_of_purchase <- as.Date(data$date_of_purchase, "%Y-%m-%d")
data$year_of_purchase <- as.numeric(format(data$date_of_purchase, "%Y"))
data$days_since <- as.numeric(difftime(time1 = "2016-01-01",
time2 = data$date_of_purchase,
units = "days"))
Let's compute key marketing indicators: RFM variables as of a year ago, in 2014
library(dplyr)
customers_2014 <- data %>% filter(days_since > 365) %>%
group_by(customer_id) %>%
summarize( # creates new variables:
recency = min(days_since) - 365,
first_purchase = max(days_since) - 365,
frequency = n(),
avg_amount = mean(purchase_amount),
max_amount = max(purchase_amount) )
Here we are extracting all the predictors that we will use in the predictive model. Predictors are variables computed a year ago.
revenue_2015 <- data %>% filter(year_of_purchase == 2015) %>%
group_by(customer_id) %>%
summarize( # creates a new variable:
revenue_2015 = sum(purchase_amount) )
Here we are extracting the target variable. The next step is to merge both predictors and target variable to create the calibration dataset (ie the training dataset).
# in_sample should be named training_df
# out_sample should be named test_df
in_sample <- merge(customers_2014, revenue_2015, all.x = TRUE)
in_sample$revenue_2015[is.na(in_sample$revenue_2015)] <- 0
# we add a new variable 'active_2015': 1 if there was a revenue in 2015, 0 otherwise
in_sample$active_2015 <- as.numeric(in_sample$revenue_2015 > 0)
in_sample dataset includes now the following data:
- from customer_2014:
- first_purchase
- frequency
- avg_amount
- max_amount
- from revenue_2015:
- revenue_2015
- active_2015 (1 or 0)
head(in_sample)
## customer_id recency first_purchase frequency avg_amount max_amount
## 1 10 3463.9583 3463.958 1 30.0 30
## 2 80 301.9583 3385.958 6 70.0 80
## 3 90 392.9583 3417.958 10 115.8 153
## 4 120 1035.9583 1035.958 1 20.0 20
## 5 130 2604.9583 3344.958 2 50.0 60
## 6 160 2597.9583 3211.958 2 30.0 30
## revenue_2015 active_2015
## 1 0 0
## 2 80 1
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
summary(in_sample)
## customer_id recency first_purchase frequency
## Min. : 10 Min. : 0.958 Min. : 0.958 Min. : 1.000
## 1st Qu.: 77710 1st Qu.: 257.958 1st Qu.: 795.958 1st Qu.: 1.000
## Median :127140 Median : 889.958 Median :1890.958 Median : 2.000
## Mean :127315 Mean :1122.545 Mean :1788.369 Mean : 2.665
## 3rd Qu.:181800 3rd Qu.:1867.958 3rd Qu.:2694.958 3rd Qu.: 3.000
## Max. :245840 Max. :3648.958 Max. :3650.958 Max. :40.000
## avg_amount max_amount revenue_2015 active_2015
## Min. : 5.00 Min. : 5.00 Min. : 0.00 Min. :0.0000
## 1st Qu.: 21.25 1st Qu.: 25.00 1st Qu.: 0.00 1st Qu.:0.0000
## Median : 30.00 Median : 30.00 Median : 0.00 Median :0.0000
## Mean : 55.53 Mean : 65.72 Mean : 21.22 Mean :0.2299
## 3rd Qu.: 50.00 3rd Qu.: 60.00 3rd Qu.: 0.00 3rd Qu.:0.0000
## Max. :4500.00 Max. :4500.00 Max. :4500.00 Max. :1.0000
Here, we create a model to predict whether a customer is active in 2015 or not, based on the customer data of 2014.
The outcome of the prediction is either 1 or 0 (ie. two classes). This is a classification problem that can be solved using a logistic regression. The following code fits the above training set to a logistic regression model.
As required in this problem set, the predictors are now: (1) recency, (2) the log of recency, (3) frequency, (4) the log of frequency.
prob.model <- glm(active_2015 ~ recency + log(recency) + frequency + log(frequency),
family=binomial(link='logit'),
data=in_sample)
summary(prob.model)
##
## Call:
## glm(formula = active_2015 ~ recency + log(recency) + frequency +
## log(frequency), family = binomial(link = "logit"), data = in_sample)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1792 -0.5792 -0.2533 -0.0920 3.5471
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.147e-01 8.437e-02 3.73 0.000191 ***
## recency -1.347e-03 6.378e-05 -21.12 < 2e-16 ***
## log(recency) -2.223e-01 1.810e-02 -12.29 < 2e-16 ***
## frequency -1.394e-02 1.885e-02 -0.74 0.459381
## log(frequency) 8.788e-01 7.205e-02 12.20 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 18228 on 16904 degrees of freedom
## Residual deviance: 12051 on 16900 degrees of freedom
## AIC: 12061
##
## Number of Fisher Scoring iterations: 6
Then shows the coefficients of the model
coef <- summary(prob.model)$coefficients[,1]
# see to find out how to access this field:
# attributes( summary( prob.model )$coefficients )
coef
## (Intercept) recency log(recency) frequency log(frequency)
## 0.314704392 -0.001346672 -0.222337680 -0.013943181 0.878833064
The coefficient (weight) of the recency is negative, which makes sense: the larger the recency, the smaller the probability the customer will make another purchase in the future.
Then shows its standard deviation:
#std <- summary(prob.model)$standard.errors
std <- summary(prob.model)$coefficients[,2]
std
## (Intercept) recency log(recency) frequency log(frequency)
## 8.436725e-02 6.377517e-05 1.809762e-02 1.884553e-02 7.204729e-02
and the ratio of the coefficient divided by its standard deviation:
coef / std
## (Intercept) recency log(recency) frequency log(frequency)
## 3.7301725 -21.1159242 -12.2854634 -0.7398668 12.1980031
As a rule, when coef/std is above 2 or below -2, the parameter value is significant (within a 95% confidence interval, ie Z = +/- 1.96). Here, the impact of frequency on the predictions is limited.
Here, we create a model to predict the revenue generated by a customer in 2015, based on the average amount and maximum amount he spent in 2014.
First, let's select from the training set only customers who made a purchase in 2015:
z <- which(in_sample$active_2015 == 1) # store the index of active customer in 2015
head(in_sample[z, ])
## customer_id recency first_purchase frequency avg_amount max_amount
## 2 80 301.95833 3385.958 6 70.00000 80
## 18 480 15.95833 3312.958 11 62.27273 235
## 30 830 266.95833 3373.958 6 48.33333 60
## 31 850 61.95833 3050.958 8 28.12500 30
## 32 860 266.95833 3642.958 9 53.33333 60
## 39 1020 1461.95833 1826.958 3 73.33333 100
## revenue_2015 active_2015
## 2 80 1
## 18 45 1
## 30 50 1
## 31 60 1
## 32 60 1
## 39 120 1
summary(in_sample[z, ])
## customer_id recency first_purchase frequency
## Min. : 80 Min. : 0.958 Min. : 0.958 Min. : 1.000
## 1st Qu.: 78590 1st Qu.: 22.958 1st Qu.: 649.708 1st Qu.: 2.000
## Median :143550 Median : 96.958 Median :1603.958 Median : 4.000
## Mean :134907 Mean : 306.305 Mean :1636.122 Mean : 4.741
## 3rd Qu.:194363 3rd Qu.: 327.958 3rd Qu.:2665.958 3rd Qu.: 7.000
## Max. :236660 Max. :3543.958 Max. :3646.958 Max. :40.000
## avg_amount max_amount revenue_2015 active_2015
## Min. : 5.00 Min. : 5.00 Min. : 5.0 Min. :1
## 1st Qu.: 30.00 1st Qu.: 30.00 1st Qu.: 30.0 1st Qu.:1
## Median : 40.00 Median : 50.00 Median : 50.0 Median :1
## Mean : 67.78 Mean : 88.33 Mean : 92.3 Mean :1
## 3rd Qu.: 60.00 3rd Qu.: 80.00 3rd Qu.: 100.0 3rd Qu.:1
## Max. :4500.00 Max. :4500.00 Max. :4500.0 Max. :1
Then let's fit the monetary model with the training set: revenue generated by a customer in 2015, as a function of its average amount and its maximum amount. Since the revenue can take a range of value (not only either 1 or 0), this is a regression problem that can be solved here using a linear model (lm):
amount.model <- lm(
formula = revenue_2015 ~ avg_amount + max_amount,
data = in_sample[z, ]) # fits the model only to active customers in 2015 in the training set
summary(amount.model)
##
## Call:
## lm(formula = revenue_2015 ~ avg_amount + max_amount, data = in_sample[z,
## ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2138.5 -20.1 -16.6 0.1 3361.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.74709 2.38112 8.713 <2e-16 ***
## avg_amount 0.67486 0.03280 20.575 <2e-16 ***
## max_amount 0.29225 0.02363 12.367 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 136.6 on 3883 degrees of freedom
## Multiple R-squared: 0.6054, Adjusted R-squared: 0.6052
## F-statistic: 2979 on 2 and 3883 DF, p-value: < 2.2e-16
All statistics (standard error, t value) are highly significant. The coefficient of determination R^2 value is 0.60.
In order to validate our our model (on the training set), let's plot the prediction of revenue generated by 2015 customers, computed with the linear model, against the revenue generated by 2015 customers (the ground truth):
plot(x = in_sample[z, ]$revenue_2015,
y = amount.model$fitted.values)
# Note: ggplot requires data from the same dataframe. Simpler to use plot here.
If our model is accurate, we should see linear relation between the prediction and the ground truth, which is not the case here: most customers have spent small amounts,few outliers have spent huge amounts.
Let's try again with a log-transform:
amount.model <- lm(
formula = log(revenue_2015) ~ log(avg_amount) + log(max_amount),
data = in_sample[z, ])
summary(amount.model)
##
## Call:
## lm(formula = log(revenue_2015) ~ log(avg_amount) + log(max_amount),
## data = in_sample[z, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7866 -0.1811 -0.0770 0.1852 3.5656
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.37000 0.04003 9.242 <2e-16 ***
## log(avg_amount) 0.54881 0.04167 13.171 <2e-16 ***
## log(max_amount) 0.38813 0.03796 10.224 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4781 on 3883 degrees of freedom
## Multiple R-squared: 0.6927, Adjusted R-squared: 0.6926
## F-statistic: 4377 on 2 and 3883 DF, p-value: < 2.2e-16
We can see that the R^2 has improved to 0.69 meaning that we fit the data much better.
plot(x = log(in_sample[z, ]$revenue_2015),
y = amount.model$fitted.values)
This is much better. We put less weight to the very large values and more weight to the smaller values.
Now that we fitted our two models (prob.model and amount.model) to 2015's data (as a function of 2014's data), we want to apply the two models to 2015's data in order to predict future behavior of customers in 2016.
customers_2015 <- data %>% group_by(customer_id) %>%
summarize( # creates new variables:
recency = min(days_since),
first_purchase = max(days_since),
frequency = n(),
avg_amount = mean(purchase_amount),
max_amount = max(purchase_amount) )
Now we are predicting the probability that a customer will be active in 2016 (ie the probability that the outcome is 1):
customers_2015$prob_predicted <- predict(object = prob.model,
newdata = customers_2015,
type = "response") # argument to get the probability that the outcome is 1, using the glm function
and predicting the revenue generated by a customer in 2016 (taking the exp of the prediction, since we fitted the model using a log transform):
customers_2015$revenue_predicted <- exp(predict(object = amount.model,
newdata = customers_2015))
Then we compute the score of customers, defined as the probability that a customer will be active in 2016 times the predicted revenue generated in 2016:
customers_2015$score_predicted <- customers_2015$prob_predicted * customers_2015$revenue_predicted
If there is a 10% chance that a customer generates $100, the score will be $10.
Then displays results:
summary(customers_2015$prob_predicted)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0009581 0.0194200 0.0963900 0.2252000 0.3638000 0.9532000
Customers has a 22.5% chance of a customer being active in 2016, on average.
summary(customers_2015$revenue_predicted)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.54 29.00 35.05 65.63 57.30 3833.00
Customers are predicted to generate a revenue of $65 in 2016, on average. The minimum revenue is $6.5 (and not $0 because it assumes the customer will spend something).
summary(customers_2015$score_predicted)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0113 0.6806 4.0730 18.2500 17.6000 2728.0000
The score has a mean of 18.2. This value is extremely important from a managerial standpoint: on average, every customer will spend $18.2 next year, in 2016. This is the expected revenue in 2016.
library(ggplot2)
ggplot(data=customers_2015, aes(score_predicted)) +
geom_histogram(binwidth = 5, color = I('black'), fill= I('orange')) +
scale_x_continuous(limits = c(0, 200),
breaks = seq(0, 200, 20))
## Warning: Removed 188 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
1/ In the modified model, which of the following predictors does not seem to bear a significant influence on probability predictions?
summary(prob.model)
##
## Call:
## glm(formula = active_2015 ~ recency + log(recency) + frequency +
## log(frequency), family = binomial(link = "logit"), data = in_sample)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1792 -0.5792 -0.2533 -0.0920 3.5471
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.147e-01 8.437e-02 3.73 0.000191 ***
## recency -1.347e-03 6.378e-05 -21.12 < 2e-16 ***
## log(recency) -2.223e-01 1.810e-02 -12.29 < 2e-16 ***
## frequency -1.394e-02 1.885e-02 -0.74 0.459381
## log(frequency) 8.788e-01 7.205e-02 12.20 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 18228 on 16904 degrees of freedom
## Residual deviance: 12051 on 16900 degrees of freedom
## AIC: 12061
##
## Number of Fisher Scoring iterations: 6
Frequency has a small Z_value (between -2 and 2) and hence doesn't seem to bear a significant influence on probability predictions. Recency seems to have the most impact on the predictions, with a Z_value = -21.
2/ Suppose a customer has a 20% probability of being active, and a predicted revenue of $200 if he is. Compute the score of this costumer.
Score = probability x predicted revenue = 20% x $200 = $40.
The score of that customer is $40 (ie on average, this customer is expected to generate $40 in revenue next year). This customer has a 80% chance of not buying anything next year.
3/ Looking at the predictions made for the first 10 customers in the database, which of the following statements is correct?
head(customers_2015, 10)
## Source: local data frame [10 x 9]
##
## customer_id recency first_purchase frequency avg_amount max_amount
## (int) (dbl) (dbl) (int) (dbl) (dbl)
## 1 10 3828.9583 3828.958 1 30.00000 30
## 2 80 342.9583 3750.958 7 71.42857 80
## 3 90 757.9583 3782.958 10 115.80000 153
## 4 120 1400.9583 1400.958 1 20.00000 20
## 5 130 2969.9583 3709.958 2 50.00000 60
## 6 160 2962.9583 3576.958 2 30.00000 30
## 7 190 2210.9583 3689.958 5 68.00000 100
## 8 220 2057.9583 3663.958 2 25.00000 30
## 9 230 3984.9583 3984.958 1 50.00000 50
## 10 240 462.9583 3814.958 4 16.25000 20
## Variables not shown: prob_predicted (dbl), revenue_predicted (dbl),
## score_predicted (dbl)
Customer #80 is more likely than not to make a purchase next year. Customer #190 has made 5 purchases so far. Customer #230 has a score of 0.0564. If active, customer #130 is expected to spend $60.7 on average.
4/ Among the following customers, which one is expected to be the most profitable for the firm?
The one who has the highest score is the most profitable for the firm. For the first 10 customers in the database, it is customer #90. See table above.
5/ In the probability model, why is the sign of the recency parameter expected to be negative?
summary(prob.model)
##
## Call:
## glm(formula = active_2015 ~ recency + log(recency) + frequency +
## log(frequency), family = binomial(link = "logit"), data = in_sample)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1792 -0.5792 -0.2533 -0.0920 3.5471
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.147e-01 8.437e-02 3.73 0.000191 ***
## recency -1.347e-03 6.378e-05 -21.12 < 2e-16 ***
## log(recency) -2.223e-01 1.810e-02 -12.29 < 2e-16 ***
## frequency -1.394e-02 1.885e-02 -0.74 0.459381
## log(frequency) 8.788e-01 7.205e-02 12.20 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 18228 on 16904 degrees of freedom
## Residual deviance: 12051 on 16900 degrees of freedom
## AIC: 12061
##
## Number of Fisher Scoring iterations: 6
The negative sign of the recency parameter ('estimate') means that the larger the recency, the smaller the probability that a customer will be active the following year.
6/ How many customers have an expected revenue of more than $50
z = which(customers_2015$score_predicted > 50)
length(z)
## [1] 1394
1394 customers have an expected revenue of more than $50 in 2016. We have the list of customers on which we should spend the most marketing dollar (ie the customers with the highest score).