-
Notifications
You must be signed in to change notification settings - Fork 0
/
wfmForecasting.Rmd
419 lines (261 loc) · 12 KB
/
wfmForecasting.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
---
title: "Workforce Management Forecasting with R"
author: "Tesfahun Tegene Boshe"
date: "03/03/2022"
output:
html_document:
toc: true
toc_depth: 3
toc_float: true
code_folding: show
keep_md: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Introduction
Forecasting is the first important stage of workforce management planning. WFM forecastors create the forecast of ,including others, volume of conversations to be expected in some future time. Forecasting future contacts is a difficult task because of the uncertainty associated with the many factors that determine contacts as well as the nature of distribution of contacts in time.
Time series data is an object observed in many consecutive units of time in chronological order. There are three various methods of dealing with time series forecasting.
**Time series models** - Time series models define the current value of a variable as a function of its own history. Thus no other parameter is involved.
**Econometric models** - Econometric models use other supportive variables to model the variation in the target variable.
**Hybrid models** - Hybrid models enjoy the best of the two worlds.
While there is not a single best method for forecasting, there are widely used methods and best practices. We will discuss some of the most commonly used time series models and hybrid approaches in this blog. We will also cover the metrics which can indicate the accuracy of the forecast.
*accuracy()* function from forecast package gives mean error (ME), root mean square error (RMSE), mean absolute error (MAE), mean percentage error (MPE), mean absolute percentage (MAPE), mean absolute squared error (MASE), autocorrelation function index (ACFI) and Theil's U values.
MAPE is considered the best accuracy measure since it is not sensitive to sign of error and the magnitude of units. We will use MAPE measures to compare the accuracy of different methods to be discussed.
```{r, echo=F}
setwd("C:/Users/Tesfahun Boshe/Documents/WFM/WFM-forecasting-with-R")
```
## EDA
Loading the data
```{r}
HistoricalData <- read.csv("historicalData.csv")
```
### Data Description
The data we will use is historical monthly volume data from Jan 2019 to Dec 2022 from a hypothetical contact center for a single channel.
* Period - first day of the month representing the particular month
* Email - the number of email contacts
* Phone - the number of phone call contacts
```{r}
HistoricalData$Period <- as.Date(HistoricalData$Period,format = "%d/%m/%Y")
HistoricalData$Email <- as.numeric(HistoricalData$Email)
HistoricalData$Phone <- as.numeric(HistoricalData$Phone)
str(HistoricalData)
```
Our goal is to forecast the email contacts for the next 12 months. The phone contacts will be used as an additional variable when we discuss hybrid models.
Let's define the Email column as a time series object. *ts()* function from stats package does just that.
```{r}
data_ts <- ts(HistoricalData[,2], frequency = 12) # 12 since monthly data
```
### Time Series Plots
Time series plots help us to identify outliers in the data. The outliers need to be handled before modeling.
To plot the time series data, we will use fpp2 package.
```{r}
# install.packages("fpp2")
library(fpp2)
autoplot(data_ts) # time series plot
```
```{r}
ggseasonplot(data_ts,main = "Season Plot") # time series plot for each season
```
```{r}
ggseasonplot(data_ts, polar = T,main = "Polar Season Plot") # polar season plot
```
### Data partitioning
It is important to test your models on the historical data. We will, therefore, divide our historical data to train and test samples. The true test data is ofcourse unknown and it is in the future, however, testing on a out-of-sample data gives you confidence that the model will be robust against the future reality as well.
```{r}
train <- subset(data_ts, end = length(data_ts) - 12) # year 2019 and 2020
test <- subset(data_ts, start = length(data_ts) - 12+1) # 2021 data
```
# Methods
## 1. Naive Method/ Random walk without drift
*naive()* function from *forecast* base package can be used for naive forecasts. Naive forecast extends the last value for the selected forecast period.
```{r}
fc_naive <- naive(train, 12) # forecast 12 periods ahead
autoplot(fc_naive) # train, forecast together.
```
We can check the error band and the residuals.
```{r}
summary(fc_naive)
checkresiduals(fc_naive) # Do they look like white noise?
```
## 2. Random Walk with Drift - Seasonal Naive
Instead of extending the very last instance of train data, we can extend the respective last value from the last period for each month using seasonal naive models.
```{r}
fc_snaive <- snaive(train, 12)
autoplot(fc_snaive) # captures some seasonality in the data
```
```{r}
# error band and the residuals
checkresiduals(fc_snaive) # does the residual look like white noise?
summary(fc_snaive)
```
```{r}
#accuracy
accuracy(fc_snaive, data_ts)['Test set','MAPE']
```
## 3.Holt-Winter's exponential smoothing method (triple exponential smoothing)
*ses()*,*holt()*,*hw()* functions from the base package - *forecast* can be used for Holt-Winter's exponential smoothing algorithm. We need to check whether or not our data has trend, seasonality. Or we can be lazy and use *ets()* function from the same package that works great in all cases.
a. Simple exponential smoothing - no trend, seasonality
```{r}
fc_ses <- ses(train, h = 12) # only for a stationary ts
```
```{r}
autoplot(fc_ses) # forecast plot
```
b. Simple exponential smoothing with trend
```{r}
fc_ses_t <- holt(train,h = 12)
autoplot(fc_ses_t)
```
c. Simple exponential smoothing with trend and seasonality
```{r}
# Simple exponential smoothing with trend and seasonality
fc_ses_ts_m <- hw(train, seasonal = "multiplicative", h = 12)
fc_ses_ts_a <- hw(train, seasonal = "additive", h = 12)
autoplot(fc_ses_ts_m)
autoplot(fc_ses_ts_a)
```
Accuracy
```{r}
accuracy(fc_ses_ts_m, data_ts)['Test set','MAPE']
```
## 4. ETS - error trend seasonality
It is Automatic forecasting with exponential smoothing for a wide range of time series.
```{r}
fitets <- ets(train)
fc_ETS <- forecast(fitets,h = 12)
```
Accuracy
```{r}
accuracy(fc_ETS, data_ts)['Test set','MAPE']
```
## 5. Box-Jenkins - ARIMA/SARIMA/SARIMAX
Box-Jenkin's methods are popular choices for short term forecasts. They should not be used beyond few periods since after some periods, the values are constant.
Box Jenkin's models require a stationary time series. A stationary data does not have trend, seasonality and has mean zero variance over time. Therefore, it is important to remove those components from the time series, in other words *stationarize the data.*
A signal can be decomposed into a DC and AC component. (The DC component is also called level and the AC components can further be decomposed into trend, seasonality, noise.)
*Trend* - the general short-term change in direction (Up/down)
*Seasonality* - increases and decreases in regular time of day, day of week, etc
**Stationarity Tests**
Dickey-Fuller Tests and augmented Dickey-Fuller Test are the commonly used tests for stationarity. KSPSS test, PP test and White noise test(Ljung-Box test) can also be used.
```{r}
library(tseries)
adf.test(train) # augmented dicky fuller test
```
> Interpretation: p-value is not less than 0.05, we fail to reject the null hypothesis. This means the time series is non-stationary.
**Data stationarization ** - stabilize the variance
a. Box-Cox transformations
Attempt for various lambda vales until the plot is stationary.
```{r}
# Box-Cox transformations
train %>% BoxCox(lambda = .3) %>% autoplot()
train %>% BoxCox(lambda = -0.7) %>% autoplot()
BoxCox.lambda(train) # the best lambda
```
b. Differencing
```{r}
diff(train)
autoplot(diff(train)) # Plot of differenced data
```
The autocorrelation function plots can tell if there is remaining trend/seasonality.
```{r}
ggAcf(diff(train))
```
c. Seasonal differencing
```{r}
difftrain <- diff(train, lag = 12)
```
d. logarithmic damping - used to reduce variance.
```{r}
log_train <- log(train)
```
### Automatic Arima models
Auto.Arima is a robust technique which can handle non-stationary time series data. The algorithm finds the types and number of operations needed to stationarize the data and applies accordingly.
```{r}
fit <- auto.arima(train,seasonal = T) # seasonal = F for non-Seasonal
fc_arima<- fit %>% forecast(h = 12)
autoplot(fc_arima)
```
Accuracy
```{r}
accuracy(fc_arima, data_ts)['Test set','MAPE']
```
### ARIMA models
We can also define customized ARIMA models after defining the parameters order = (p,d,q), seasonal = (P,D,Q). The parameters are found in the process of stationarizing the series.
* p - number of lags (Autoregressive part)
* d - number of differencing needed
* q - number of error lags (Moving average part)
* P - number of lags (Autoregressive part)
* D - number of seasonal differencing needed
* Q - number of error lags (Moving average part)
```{r}
train %>% Arima(order = c(0, 1, 0),seasonal = c(0, 1, 0), include.constant = F) %>% forecast() %>% autoplot()
```
## Hybrid Method
In this section, we will see how we can combine econometric approaches and time series approaches. We will need a second variable at this stage.
### Dynamic regression with auto.arima
```{r}
data_ts_all <- ts(HistoricalData[,c(2,3)], frequency = 12)
data_ts_phoneTest <- data_ts_all[,"Phone"][37:48] # to be used later
```
Let's plot them together
```{r}
autoplot(data_ts_all, facets = TRUE) # separately
```
Fit the model
```{r}
fit <- auto.arima(data_ts_all[,'Email'], xreg = data_ts_all[,'Phone'], stationary = TRUE)
```
Checking the regression coefficient - positive coefficient indicates direct relationship.
```{r}
# check the regression coefficient
coefficients(fit)
```
Forecast
```{r}
fc_dynRegre <- forecast(fit, xreg = data_ts_phoneTest)
# Plot fc with x and y labels
autoplot(fc_dynRegre) + xlab("Month") + ylab("Email")
```
We can also assume non linear relationships between the variables.
```{r}
xreg <- cbind(Phone = data_ts_all[,'Phone'],
PhoneSquared = data_ts_all[,'Phone']^2,
PhoneCubed = data_ts_all[,'Phone']^3)
# Fit model
fit <- auto.arima(data_ts_all[, "Email"], xreg = xreg)
# Forecast fit 12 month ahead
fc_dynRegre_NL<- forecast(fit,
xreg = cbind(data_ts_phoneTest, data_ts_phoneTest^2, data_ts_phoneTest^3))
```
### Dynamic Harmonic regression - using fourier terms
```{r}
# Set up harmonic regressors of order K
# K must be not be greater than period/2
harmonics <- fourier(train, K = 1)
# Fit regression model with ARIMA errors
fit <- auto.arima(train, xreg = harmonics, seasonal = FALSE)
# Forecasts next years
newharmonics <- fourier(train, K = 1, h = 12)
fc_fourier <- forecast(fit, xreg = newharmonics)
# Plot forecasts fc_fourier
autoplot(fc_fourier)
```
Accuracy
```{r}
accuracy(fc_fourier, data_ts)['Test set','MAPE']
```
### TBATS model
Combination of models so far - trignometric models, box cox, arma errors, trend, seasonal. It is great for multiple seasonality time series with strong trend.
```{r}
fit <- tbats(train)
# Forecast the series for the next 5 years
fc_TBAT <- forecast(fit, h = 12)
# Plot the forecasts
autoplot(fc_TBAT)
```
Accuracy
```{r}
accuracy(fc_TBAT, data_ts)['Test set','MAPE']
```
# Conclusion
As we have seen above, complex approaches do not always give the best result. It is, therefore, necessary to try multiple methods and choose the one that works the best for your problem. Forecasting is a process where models are continually fine-tuned for better accuracy. R can definitely make some of the hard work that comes with this easier.