wfmForecasting.Rmd

---
title: "Workforce Management Forecasting with R"
author: "Tesfahun Tegene Boshe"
date: "03/03/2022"
output: 
  html_document:
    toc: true
    toc_depth: 3
    toc_float: true
    code_folding: show
    keep_md: true
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Introduction

Forecasting is the first important stage of workforce management planning. WFM forecastors create the forecast of ,including others, volume of conversations to be expected in some future time. Forecasting future contacts is a difficult task because of the uncertainty associated with the many factors that determine contacts as well as the nature of distribution of contacts in time.  

Time series data is an object observed in many consecutive units of time in chronological order. There are three various methods of dealing with time series forecasting. 

**Time series models** - Time series models define the current value of a variable as a function of its own history. Thus no other parameter is involved. 

**Econometric models** - Econometric models use other supportive variables to model the variation in the target variable.

**Hybrid models** - Hybrid models enjoy the best of the two worlds. 

While there is not a single best method for forecasting, there are widely used methods and best practices. We will discuss some of the most commonly used time series models and hybrid approaches in this blog. We will also cover the metrics which can indicate the accuracy of the forecast. 

*accuracy()* function from forecast package gives mean error (ME), root mean square error (RMSE), mean absolute error (MAE), mean percentage error (MPE), mean absolute percentage (MAPE), mean absolute squared error (MASE), autocorrelation function index (ACFI) and Theil's U values. 

MAPE is considered the best accuracy measure since it is not sensitive to sign of error and the magnitude of units. We will use MAPE measures to compare the accuracy of different methods to be discussed. 


```{r, echo=F}

setwd("C:/Users/Tesfahun Boshe/Documents/WFM/WFM-forecasting-with-R")
```

## EDA

Loading the data
```{r}
HistoricalData <- read.csv("historicalData.csv")

```

### Data Description

The data we will use is historical monthly volume data from Jan 2019 to Dec 2022 from a hypothetical contact center for a single channel.

* Period - first day of the month representing the particular month
* Email - the number of email contacts
* Phone - the number of phone call contacts

```{r}

HistoricalData$Period <- as.Date(HistoricalData$Period,format = "%d/%m/%Y")
HistoricalData$Email <- as.numeric(HistoricalData$Email)
HistoricalData$Phone <- as.numeric(HistoricalData$Phone)
str(HistoricalData)

```

Our goal is to forecast the email contacts for the next 12 months. The phone contacts will be used as an additional variable when we discuss hybrid models. 

Let's define the Email column as a time series object. *ts()* function from stats package does just that. 
```{r}

data_ts <- ts(HistoricalData[,2], frequency = 12) # 12 since monthly data

```

### Time Series Plots

Time series plots help us to identify outliers in the data. The outliers need to be handled before modeling. 

To plot the time series data, we will use fpp2 package. 
```{r}
# install.packages("fpp2")
library(fpp2) 

autoplot(data_ts)  # time series plot

```

```{r}

ggseasonplot(data_ts,main = "Season Plot") # time series plot for each season

```

```{r}

ggseasonplot(data_ts, polar = T,main = "Polar Season Plot") # polar season plot
```


### Data partitioning
It is important to test your models on the historical data. We will, therefore, divide our historical data to train and test samples. The true test data is ofcourse unknown and it is in the future, however, testing on a out-of-sample data gives you confidence that the model will be robust against the future reality as well. 


```{r}

train <- subset(data_ts, end = length(data_ts) - 12) # year 2019 and 2020
test <- subset(data_ts, start = length(data_ts) - 12+1) # 2021 data

```

# Methods

## 1. Naive Method/ Random walk without drift

*naive()* function from *forecast* base package can be used for naive forecasts. Naive forecast extends the last value for the selected forecast period. 

```{r}

fc_naive <- naive(train, 12) # forecast 12 periods ahead

autoplot(fc_naive) # train, forecast together. 
```

We can check the error band and the residuals. 
```{r}

summary(fc_naive)
checkresiduals(fc_naive) # Do they look like white noise? 
```

## 2. Random Walk with Drift - Seasonal Naive

Instead of extending the very last instance of train data, we can extend the respective last value from the last period for each month using seasonal naive models. 

```{r}

fc_snaive <- snaive(train, 12)
autoplot(fc_snaive) # captures some seasonality in the data

```

```{r}
# error band and the residuals
checkresiduals(fc_snaive) # does the residual look like white noise?
summary(fc_snaive)
```


```{r}
#accuracy
accuracy(fc_snaive, data_ts)['Test set','MAPE']

```

## 3.Holt-Winter's exponential smoothing method (triple exponential smoothing)

*ses()*,*holt()*,*hw()* functions from the base package - *forecast* can be used for Holt-Winter's exponential smoothing algorithm. We need to check whether or not our data has trend, seasonality. Or we can be lazy and use *ets()* function from the same package that works great in all cases.

a. Simple exponential smoothing - no trend, seasonality
```{r}

fc_ses <- ses(train, h = 12) # only for a stationary ts

```

```{r}
autoplot(fc_ses) # forecast plot

```


b. Simple exponential smoothing with trend
```{r}
fc_ses_t <- holt(train,h = 12)
autoplot(fc_ses_t) 

```

c. Simple exponential smoothing with trend and seasonality

```{r}
# Simple exponential smoothing with trend and seasonality
fc_ses_ts_m <- hw(train, seasonal = "multiplicative", h = 12)
fc_ses_ts_a <- hw(train, seasonal = "additive", h = 12)

autoplot(fc_ses_ts_m) 
autoplot(fc_ses_ts_a)
```

Accuracy
```{r}
accuracy(fc_ses_ts_m, data_ts)['Test set','MAPE']
```

## 4. ETS - error trend seasonality

It is Automatic forecasting with exponential smoothing for a wide range of time series. 

```{r}
fitets <- ets(train)

fc_ETS <- forecast(fitets,h = 12)
```

Accuracy
```{r}
accuracy(fc_ETS, data_ts)['Test set','MAPE']
```


## 5. Box-Jenkins - ARIMA/SARIMA/SARIMAX

Box-Jenkin's methods are popular choices for short term forecasts. They should not be used beyond few periods since after some periods, the values are constant. 

Box Jenkin's models require a stationary time series. A stationary data does not have trend, seasonality and has mean zero variance over time. Therefore, it is important to remove those components from the time series, in other words *stationarize the data.* 

A signal can be decomposed into a DC and AC component. (The DC component is also called level and the AC components can further be decomposed into trend, seasonality, noise.)

*Trend* - the general short-term change in direction (Up/down)

*Seasonality* - increases and decreases in regular time of day, day of week, etc 

**Stationarity Tests**
Dickey-Fuller Tests and augmented Dickey-Fuller Test are the commonly used tests for stationarity. KSPSS test, PP test and White noise test(Ljung-Box test) can also be used. 

```{r}
library(tseries)

adf.test(train) # augmented dicky fuller test

```

> Interpretation: p-value is not less than 0.05, we fail to reject the null hypothesis. This means the time series is non-stationary.

**Data stationarization ** - stabilize the variance

a. Box-Cox transformations

Attempt for various lambda vales until the plot is stationary. 
```{r}

# Box-Cox transformations
train %>% BoxCox(lambda = .3) %>% autoplot()
train %>% BoxCox(lambda = -0.7) %>% autoplot()

BoxCox.lambda(train) # the best lambda

```

b. Differencing
```{r}
diff(train)
autoplot(diff(train)) # Plot of differenced data

```

The autocorrelation function plots can tell if there is remaining trend/seasonality. 

```{r}
ggAcf(diff(train)) 
```

c. Seasonal differencing

```{r}
difftrain <- diff(train, lag = 12)

```

d. logarithmic damping - used to reduce variance. 

```{r}
log_train <- log(train)
```

### Automatic Arima models
Auto.Arima is a robust technique which can handle non-stationary time series data. The algorithm finds the types and number of operations needed to stationarize the data and applies accordingly. 

```{r}

fit <- auto.arima(train,seasonal = T) # seasonal = F for non-Seasonal
fc_arima<- fit %>% forecast(h = 12)
autoplot(fc_arima)
```

Accuracy
```{r}
accuracy(fc_arima, data_ts)['Test set','MAPE']
```


### ARIMA models

We can also define customized ARIMA models after defining the parameters order = (p,d,q), seasonal = (P,D,Q). The parameters are found in the process of stationarizing the series. 
 
* p - number of lags (Autoregressive part)
* d - number of differencing needed
* q - number of error lags (Moving average part)

* P - number of lags (Autoregressive part)
* D - number of seasonal differencing needed
* Q - number of error lags (Moving average part)

```{r}

train %>% Arima(order = c(0, 1, 0),seasonal = c(0, 1, 0), include.constant = F) %>% forecast() %>% autoplot()

```


## Hybrid Method

In this section, we will see how we can combine econometric approaches and time series approaches. We will need a second variable at this stage.

### Dynamic regression with auto.arima

```{r}

data_ts_all <- ts(HistoricalData[,c(2,3)], frequency = 12)

data_ts_phoneTest <- data_ts_all[,"Phone"][37:48] # to be used later
```

Let's plot them together
```{r}
autoplot(data_ts_all, facets = TRUE) # separately

```

Fit the model

```{r}
fit <- auto.arima(data_ts_all[,'Email'], xreg = data_ts_all[,'Phone'], stationary = TRUE)
```

Checking the regression coefficient - positive coefficient indicates direct relationship. 
```{r}
# check the regression coefficient
coefficients(fit) 
```

Forecast 
```{r}

fc_dynRegre <- forecast(fit, xreg = data_ts_phoneTest)

# Plot fc with x and y labels
autoplot(fc_dynRegre) + xlab("Month") + ylab("Email")

```

We can also assume non linear relationships between the variables. 
```{r}

xreg <- cbind(Phone = data_ts_all[,'Phone'], 
              PhoneSquared = data_ts_all[,'Phone']^2, 
              PhoneCubed = data_ts_all[,'Phone']^3)

# Fit model
fit <- auto.arima(data_ts_all[, "Email"], xreg = xreg)


# Forecast fit 12 month ahead
fc_dynRegre_NL<- forecast(fit, 
                          xreg = cbind(data_ts_phoneTest, data_ts_phoneTest^2, data_ts_phoneTest^3))

```

### Dynamic Harmonic regression - using fourier terms

```{r}
# Set up harmonic regressors of order K
# K must be not be greater than period/2 

harmonics <- fourier(train, K = 1)

# Fit regression model with ARIMA errors
fit <- auto.arima(train, xreg = harmonics, seasonal = FALSE)

# Forecasts next years
newharmonics <- fourier(train, K = 1, h = 12)
fc_fourier <- forecast(fit, xreg = newharmonics)

# Plot forecasts fc_fourier
autoplot(fc_fourier)

```

Accuracy
```{r}
accuracy(fc_fourier, data_ts)['Test set','MAPE']
```


### TBATS model 

Combination of models so far - trignometric models, box cox, arma errors, trend, seasonal. It is great for multiple seasonality time series with strong trend. 

```{r}

fit <- tbats(train)

# Forecast the series for the next 5 years
fc_TBAT <- forecast(fit, h = 12)

# Plot the forecasts
autoplot(fc_TBAT)

```

Accuracy
```{r}
accuracy(fc_TBAT, data_ts)['Test set','MAPE']
```

# Conclusion

As we have seen above, complex approaches do not always give the best result. It is, therefore, necessary to try multiple methods and choose the one that works the best for your problem. Forecasting is a process where models are continually fine-tuned for better accuracy. R can definitely make some of the hard work that comes with this easier.